DSS API Designer (Read dataset from DSS flow in R/Python API endpoint)

anish_anand · ‎06-12-2020

Hi Team,

Hope you are safe and well!

I am trying to explore the DSS API Designer for creating Python,R endpoints.
I have been facing a basic problem reading data set from DSS flow in the API function by using conventional dkuReadDataset() function.

I understand that it won't be directly referencing to my project folder, and APIs work outside project environment, but how do it extract the path and read the hdfs dataset in Python/R API endpoint?

Thanks,

Liev · ‎06-13-2020

Hi @anish_anand

First and foremost, you should really be careful on the load you put on an API endpoint, API node and your network in general. Streaming large datasets might bring your project, browser and network to a halt.

Having said that, there's nothing in principle stopping you from read the dataset from an endpoint. You might go for different alternatives:

- SQL endpoint (just SELECT * from table)

- Dataset lookup without filters

- Python function that read the dataset through the public API, then gets a dataframe and returns it.

As you can see all of these would return the desired dataset through your existing DSS node.

If you need to retrieve them directly (bypassing DSS), then dataset > Settings will show you the connection, folders and files (if in FS) or table name (if in DB).

anish_anand · ‎06-13-2020

Hi @Liev ,

Thanks for your response!

Yes, the approach that you mentioned makes sense. And yes, we will definitely note it down not to burden the node reading large dataset. Any guidelines around the max data volume which is permissible?

One additional questions -

How to read a hdfs dataset in my project using a Python API endpoint? I know API is a separate service altogether and i might have to define project's location and some keys. And that is what i am trying to identify, that which parameters should be defined before reading an HDFS dataset in python API

Please find attached the code i am currently using (basic read dataset in python).

Thanks

Liev · ‎06-14-2020

Since you'll be accessing your data through DSS but from outside DSS you will need to do so either via:

- a SQL endpoint as mentioned, and if your dataset is in a DB

- via public API inside your Python function endpoint.

If you use the Predictive Maintenance sample project as your sandbox, you could do something like the following:

- Define an API service with a Python function API endpoint

- Inside the function provide with the needed credentials (in practice a service account, not personal)

import pandas as pd
import dataikuapi
client = dataikuapi.DSSClient(DSSHost, apiKey)

def api_py_function(project_key, dataset_name):
    dataset = client.get_project(project_key).get_dataset(dataset_name)
    columns = [c.get('name') for c in dataset.get_schema().get('columns')]
    
    data = []
    for row in dataset.iter_rows():
        data.append(row)    
    return pd.DataFrame(columns=columns, data=data).to_json(orient='records')

- In the sample queries section, define parameters as follows:

{
   "project_key": "DKU_PREDICTIVE_MAINTENANCE",
   "dataset_name": "Assets_at_risk"
}

This will return all records in that dataset.

But, as already mentioned, you can see that if the dataset is on the large side, then iterating over the records to compile a result is not efficient.

Sign up to take part

DSS API Designer (Read dataset from DSS flow in R/Python API endpoint)

DSS API Designer (Read dataset from DSS flow in R/Python API endpoint)