Connect Databricks Catalog from Dataiku API Designer

sugata
sugata Registered Posts: 4 ✭✭
edited October 2024 in Using Dataiku

I have created a connector to query databricks catalogs from dataiku. this works fine when i test the same inside a python notebook. but from an API designer, this is not working and asking for project key.

from dataiku import SQLExecutor2

executor = SQLExecutor2(connection="NAME")

sql_query = f"""select * from catalog.bronze.table_name"""

data = executor.query_to_df(sql_query )

since the notebook and API desiner are in same python environment, do i need to pass the project key?if yes how to pass that? or there is a different way of calling the executor inside api designer?

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron

    The API node has no concept of project, when you connect to the Designer and/or Automation node you need to specify your project key. Please post your full code in a code block (the </> icon on the toolbar), in particular how you create your DSS client in your API node.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron
  • sugata
    sugata Registered Posts: 4 ✭✭
    from  dataiku.apinode.predict.predictor import ClassificationPredictor
    from dataiku import SQLExecutor2


    import dataiku
    import dataikuapi if dataiku.default_project_key() is None:
    dataiku.set_default_project_key("<project key>") def perform_lookup(panda_df): query_for_additioanl_features = f"""
    select rawdata from <databricks catalog>
    where stationId = '{ARRIVAL_ICAO_AIRPORT_CODE}'
    order by ABS(timestampdiff(minute,startTime,'{SCHEDULED_ARRIVAL_TIME_LOCAL}')) limit 1
    """ executor = SQLExecutor2(connection="BGS_DAS_ADMS_UNITY_WORKSPACE") df_data = executor.query_to_df(query_for_additioanl_features) class DynamicModelPredictor(ClassificationPredictor): def __init__(self, managed_folder_dir= None):

    client = dataiku.api_client()
    project_key = "<project key>"
    project = client.get_project(project_key)
    self.managed_folder = managed_folder_dir def predict(self, input_df): input_additional_df = perform_lookup(input_df) # perform prediction

    The executor is configured properly as i am getting desired output when running from the notebook. I have tried to provide the project key in multiple ways (as above) but everytime the API designer Test queries throwing error saying unable to locate the project key.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron

    Again I need to come back to my previous comment. You need to understand that the API node has no concept of Project Key, it's merely an infrastructure layer to run API endpoints. It doesn't have projects, datasets, etc. Your code works on a notebook bacause that notebook is running inside Dataiku. You will get the same error if you try to run the same code outside of Dataiku in a remote system, even with the Dataiku packages installed. The recommended way of doing queries in an API endpoint is to use the query Enrichments in the API endpoint. Have a look again at the links I posted. you need to follow the deployment steps as well to create the connection in the API node. It is also possible to connect to the Designer or Automation node to execute a query via them. But this is not desirable since you will be putting a dependency on your Designer or Automation node which means the API endpoint now depends on your other nodes to run API calls.

  • sugata
    sugata Registered Posts: 4 ✭✭

    Thanks for confirming the same @Turribeach

    I am trying with enrichment, but here the problem is, the actual data source is in databricks which gets updated every minute, and i need to load the delta every minute in enrichment. Also querying enrichment datasets taking huge time.

    The requirement is that the API will be sending 4 features, based on which i need to lookup database and get additional features. And then with all combined features, i need to call a predictive model. What could be the best way to handle this delta table while responding to an API request?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron

    An API is meant to return a quick near real time response. If your response takes longer than that you should be looking at batch scoring or doing a reverse API call when the API scoring finishes (ie submitter calls the DSS API only, you then call that system back in another API endpoint at their end to submit the result when ready). In any case it seems your issue is with the Databricks delta table. While does it take so long to pull the additional features from Databricks? That's your issue really. You should only be pulling back the data you need for that API call to be scored. Not the whole delta table.

  • sugata
    sugata Registered Posts: 4 ✭✭

    One more ask @Turribeach

    my input data contains a datetime filed and i want to do a lookup in the enrichment dataset with this field and can retrieve the nearest information if actual is not there. example input data has 2024-10-21T18:26:30 but let's say the enriched have information for 2024-10-21T18:22:42, which is fine for me.with sql this can be done easily, how to handle this with enrichments?basically for joining input and enrichment, not always it will be column to column join but we need to perform so many sql alike operations in retrieval,joining condition etc. how to handle that?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron

    It looks like the enrichment step only allows for simple criteria to be used, you will need to execute the query via the Designer or Automation node.

Setup Info
    Tags
      Help me…