Accessing datasets from API

Carlos-Q
Carlos-Q Registered Posts: 2 ✭✭✭✭
edited July 2024 in Setup & Configuration

Hi,

I'm working on an API node using a custom Python function. In order to run my function to work I need to get some datasets (or dataframes). Unfortunately Dataiku tells the connection timed out when executing the test queries. I tried to follow the approach given on the documentation but it's not working:

def load_dataset(name):
    client = dataikuapi.DSSClient("http://DSS_local_URL:DSS_PORT", "API_secret")
    project = client.get_project("PROJECT_KEY")
    df = get_DataFrame(name,project)
    return len(df) #just trying

def get_DataFrame(dataset, project):
    di_dataset = project.get_dataset(dataset) #Dataset  
    columns = getColumnNamesFromSchema(di_dataset.get_schema()['columns'])
    dataGenerator = di_dataset.iter_rows()
    dataFrame = pd.DataFrame(dataGenerator, columns = columns)
    return dataFrame

def getColumnNamesFromSchema(schema):
    colNames = []
    for colData in schema:
        colNames.append(colData['name'])
    return colNames

Thanks in advance!

P.S. I'm new to APIs so any help or suggestion will be greatly appreciated

Answers

  • tim-wright
    tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
    edited July 2024

    @Carlos-Q
    , how much data is in the table you are trying to query? I assume you have managed to instantiate the client, and get a project object. Have you tried loading just a few rows in to debug? You can use the code snippet below - modified to take a num_rows argument. Just make sure to update the connection and update "PROJECT_KEY" to reference your actual project.

    def load_dataset(name, num_rows):  #MODIFIED
        client = dataikuapi.DSSClient("http://DSS_local_URL:DSS_PORT", "API_secret"")
        project = client.get_project("PROJECT_KEY")
        df = get_DataFrame(name,project, num_rows)  #MODIFIED
        return df #just trying
    
    def get_DataFrame(dataset, project, num_rows):  #MODIFIED
        di_dataset = project.get_dataset(dataset) #Dataset  
        columns = getColumnNamesFromSchema(di_dataset.get_schema()['columns'])
        dataGenerator = di_dataset.iter_rows()
    
        # ----- MODIFIED ------- #
        data = []  # empty list of rows  
        for i, row in enumerate(dataGenerator): # iterate through generator
            data.append(row)  # add row to data
            if i+1==num_rows:  # if you have iterated num_rows times break
                break
        dataFrame = pd.DataFrame(data, columns = columns) # create pd.DataFrame
        # ----- MODIFIED ------- #
    
        return dataFrame
    
    def getColumnNamesFromSchema(schema):
        colNames = []
        for colData in schema:
            colNames.append(colData['name'])
        return colNames

    Let me know if that helps at all.

  • Carlos-Q
    Carlos-Q Registered Posts: 2 ✭✭✭✭

    Hi @tim-wright
    , thanks for your reply.

    I tried your code with small sets (<100 rows) without success. After checking some settings on my computer I found it to be a connection issue due to the VPN.

    I'm opening another thread about this (and webapps)

Setup Info
    Tags
      Help me…