Accessing datasets from API

Carlos-Q · October 2020

Hi,

I'm working on an API node using a custom Python function. In order to run my function to work I need to get some datasets (or dataframes). Unfortunately Dataiku tells the connection timed out when executing the test queries. I tried to follow the approach given on the documentation but it's not working:

def load_dataset(name):
    client = dataikuapi.DSSClient("http://DSS_local_URL:DSS_PORT", "API_secret")
    project = client.get_project("PROJECT_KEY")
    df = get_DataFrame(name,project)
    return len(df) #just trying

def get_DataFrame(dataset, project):
    di_dataset = project.get_dataset(dataset) #Dataset  
    columns = getColumnNamesFromSchema(di_dataset.get_schema()['columns'])
    dataGenerator = di_dataset.iter_rows()
    dataFrame = pd.DataFrame(dataGenerator, columns = columns)
    return dataFrame

def getColumnNamesFromSchema(schema):
    colNames = []
    for colData in schema:
        colNames.append(colData['name'])
    return colNames

Thanks in advance!

P.S. I'm new to APIs so any help or suggestion will be greatly appreciated

tim-wright · October 2020

@Carlos-Q
, how much data is in the table you are trying to query? I assume you have managed to instantiate the client, and get a project object. Have you tried loading just a few rows in to debug? You can use the code snippet below - modified to take a num_rows argument. Just make sure to update the connection and update "PROJECT_KEY" to reference your actual project.

def load_dataset(name, num_rows):  #MODIFIED
    client = dataikuapi.DSSClient("http://DSS_local_URL:DSS_PORT", "API_secret"")
    project = client.get_project("PROJECT_KEY")
    df = get_DataFrame(name,project, num_rows)  #MODIFIED
    return df #just trying

def get_DataFrame(dataset, project, num_rows):  #MODIFIED
    di_dataset = project.get_dataset(dataset) #Dataset  
    columns = getColumnNamesFromSchema(di_dataset.get_schema()['columns'])
    dataGenerator = di_dataset.iter_rows()

    # ----- MODIFIED ------- #
    data = []  # empty list of rows  
    for i, row in enumerate(dataGenerator): # iterate through generator
        data.append(row)  # add row to data
        if i+1==num_rows:  # if you have iterated num_rows times break
            break
    dataFrame = pd.DataFrame(data, columns = columns) # create pd.DataFrame
    # ----- MODIFIED ------- #

    return dataFrame

def getColumnNamesFromSchema(schema):
    colNames = []
    for colData in schema:
        colNames.append(colData['name'])
    return colNames

Let me know if that helps at all.

Carlos-Q · October 2020

Hi @tim-wright
, thanks for your reply.

I tried your code with small sets (<100 rows) without success. After checking some settings on my computer I found it to be a connection issue due to the VPN.

I'm opening another thread about this (and webapps)

Accessing datasets from API

Answers

Categories

Setup Info

Tags