Accessing datasets from API
Hi,
I'm working on an API node using a custom Python function. In order to run my function to work I need to get some datasets (or dataframes). Unfortunately Dataiku tells the connection timed out when executing the test queries. I tried to follow the approach given on the documentation but it's not working:
def load_dataset(name): client = dataikuapi.DSSClient("http://DSS_local_URL:DSS_PORT", "API_secret") project = client.get_project("PROJECT_KEY") df = get_DataFrame(name,project) return len(df) #just trying def get_DataFrame(dataset, project): di_dataset = project.get_dataset(dataset) #Dataset columns = getColumnNamesFromSchema(di_dataset.get_schema()['columns']) dataGenerator = di_dataset.iter_rows() dataFrame = pd.DataFrame(dataGenerator, columns = columns) return dataFrame def getColumnNamesFromSchema(schema): colNames = [] for colData in schema: colNames.append(colData['name']) return colNames
Thanks in advance!
P.S. I'm new to APIs so any help or suggestion will be greatly appreciated
Answers
-
tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
@Carlos-Q
, how much data is in the table you are trying to query? I assume you have managed to instantiate the client, and get a project object. Have you tried loading just a few rows in to debug? You can use the code snippet below - modified to take a num_rows argument. Just make sure to update the connection and update "PROJECT_KEY" to reference your actual project.def load_dataset(name, num_rows): #MODIFIED client = dataikuapi.DSSClient("http://DSS_local_URL:DSS_PORT", "API_secret"") project = client.get_project("PROJECT_KEY") df = get_DataFrame(name,project, num_rows) #MODIFIED return df #just trying def get_DataFrame(dataset, project, num_rows): #MODIFIED di_dataset = project.get_dataset(dataset) #Dataset columns = getColumnNamesFromSchema(di_dataset.get_schema()['columns']) dataGenerator = di_dataset.iter_rows() # ----- MODIFIED ------- # data = [] # empty list of rows for i, row in enumerate(dataGenerator): # iterate through generator data.append(row) # add row to data if i+1==num_rows: # if you have iterated num_rows times break break dataFrame = pd.DataFrame(data, columns = columns) # create pd.DataFrame # ----- MODIFIED ------- # return dataFrame def getColumnNamesFromSchema(schema): colNames = [] for colData in schema: colNames.append(colData['name']) return colNames
Let me know if that helps at all.
-
Hi @tim-wright
, thanks for your reply.I tried your code with small sets (<100 rows) without success. After checking some settings on my computer I found it to be a connection issue due to the VPN.
I'm opening another thread about this (and webapps)