Connect Databricks Catalog from Dataiku API Designer

sugata · October 2024

I have created a connector to query databricks catalogs from dataiku. this works fine when i test the same inside a python notebook. but from an API designer, this is not working and asking for project key.

from dataiku import SQLExecutor2

executor = SQLExecutor2(connection="NAME")

sql_query = f"""select * from catalog.bronze.table_name"""

data = executor.query_to_df(sql_query )

since the notebook and API desiner are in same python environment, do i need to pass the project key?if yes how to pass that? or there is a different way of calling the executor inside api designer?

Turribeach · October 2024

The API node has no concept of project, when you connect to the Designer and/or Automation node you need to specify your project key. Please post your full code in a code block (the </> icon on the toolbar), in particular how you create your DSS client in your API node.

Turribeach · October 2024

These documentation pages should help you understand how to run enrichment queries in the API node:

https://doc.dataiku.com/dss/latest/apinode/enrich-prediction-queries.html

https://knowledge.dataiku.com/latest/mlops-o16n/real-time-apis/concept-query-enrichments.html

sugata · October 2024

from  dataiku.apinode.predict.predictor import ClassificationPredictor

from dataiku import SQLExecutor2



import dataiku

import dataikuapi

if dataiku.default_project_key() is None:

  dataiku.set_default_project_key("<project key>")

def perform_lookup(panda_df):
  query_for_additioanl_features = f"""

    select rawdata from <databricks catalog>

    where stationId = '{ARRIVAL_ICAO_AIRPORT_CODE}'

    order by ABS(timestampdiff(minute,startTime,'{SCHEDULED_ARRIVAL_TIME_LOCAL}')) limit 1

    """
  executor = SQLExecutor2(connection="BGS_DAS_ADMS_UNITY_WORKSPACE")
  df_data = executor.query_to_df(query_for_additioanl_features)

class DynamicModelPredictor(ClassificationPredictor):
   def __init__(self, managed_folder_dir= None):


       client = dataiku.api_client()

       project_key = "<project key>"

       project = client.get_project(project_key)

       self.managed_folder = managed_folder_dir
  def predict(self, input_df):
      input_additional_df = perform_lookup(input_df)
      # perform prediction

The executor is configured properly as i am getting desired output when running from the notebook. I have tried to provide the project key in multiple ways (as above) but everytime the API designer Test queries throwing error saying unable to locate the project key.

Turribeach · October 2024

Again I need to come back to my previous comment. You need to understand that the API node has no concept of Project Key, it's merely an infrastructure layer to run API endpoints. It doesn't have projects, datasets, etc. Your code works on a notebook bacause that notebook is running inside Dataiku. You will get the same error if you try to run the same code outside of Dataiku in a remote system, even with the Dataiku packages installed. The recommended way of doing queries in an API endpoint is to use the query Enrichments in the API endpoint. Have a look again at the links I posted. you need to follow the deployment steps as well to create the connection in the API node. It is also possible to connect to the Designer or Automation node to execute a query via them. But this is not desirable since you will be putting a dependency on your Designer or Automation node which means the API endpoint now depends on your other nodes to run API calls.

sugata · October 2024

Thanks for confirming the same @Turribeach

I am trying with enrichment, but here the problem is, the actual data source is in databricks which gets updated every minute, and i need to load the delta every minute in enrichment. Also querying enrichment datasets taking huge time.

The requirement is that the API will be sending 4 features, based on which i need to lookup database and get additional features. And then with all combined features, i need to call a predictive model. What could be the best way to handle this delta table while responding to an API request?

Turribeach · October 2024

An API is meant to return a quick near real time response. If your response takes longer than that you should be looking at batch scoring or doing a reverse API call when the API scoring finishes (ie submitter calls the DSS API only, you then call that system back in another API endpoint at their end to submit the result when ready). In any case it seems your issue is with the Databricks delta table. While does it take so long to pull the additional features from Databricks? That's your issue really. You should only be pulling back the data you need for that API call to be scored. Not the whole delta table.

sugata · October 2024

One more ask @Turribeach

my input data contains a datetime filed and i want to do a lookup in the enrichment dataset with this field and can retrieve the nearest information if actual is not there. example input data has 2024-10-21T18:26:30 but let's say the enriched have information for 2024-10-21T18:22:42, which is fine for me.with sql this can be done easily, how to handle this with enrichments?basically for joining input and enrichment, not always it will be column to column join but we need to perform so many sql alike operations in retrieval,joining condition etc. how to handle that?

Turribeach · October 2024

It looks like the enrichment step only allows for simple criteria to be used, you will need to execute the query via the Designer or Automation node.

Connect Databricks Catalog from Dataiku API Designer

Setup Info

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Connect Databricks Catalog from Dataiku API Designer

Setup Info

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories