Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
I'm working on an API node using a custom Python function. In order to run my function to work I need to get some datasets (or dataframes). Unfortunately Dataiku tells the connection timed out when executing the test queries. I tried to follow the approach given on the documentation but it's not working:
def load_dataset(name):
client = dataikuapi.DSSClient("http://DSS_local_URL:DSS_PORT", "API_secret")
project = client.get_project("PROJECT_KEY")
df = get_DataFrame(name,project)
return len(df) #just trying
def get_DataFrame(dataset, project):
di_dataset = project.get_dataset(dataset) #Dataset
columns = getColumnNamesFromSchema(di_dataset.get_schema()['columns'])
dataGenerator = di_dataset.iter_rows()
dataFrame = pd.DataFrame(dataGenerator, columns = columns)
return dataFrame
def getColumnNamesFromSchema(schema):
colNames = []
for colData in schema:
colNames.append(colData['name'])
return colNames
Thanks in advance!
P.S. I'm new to APIs so any help or suggestion will be greatly appreciated
@Carlos-Q, how much data is in the table you are trying to query? I assume you have managed to instantiate the client, and get a project object. Have you tried loading just a few rows in to debug? You can use the code snippet below - modified to take a num_rows argument. Just make sure to update the connection and update "PROJECT_KEY" to reference your actual project.
def load_dataset(name, num_rows): #MODIFIED
client = dataikuapi.DSSClient("http://DSS_local_URL:DSS_PORT", "API_secret"")
project = client.get_project("PROJECT_KEY")
df = get_DataFrame(name,project, num_rows) #MODIFIED
return df #just trying
def get_DataFrame(dataset, project, num_rows): #MODIFIED
di_dataset = project.get_dataset(dataset) #Dataset
columns = getColumnNamesFromSchema(di_dataset.get_schema()['columns'])
dataGenerator = di_dataset.iter_rows()
# ----- MODIFIED ------- #
data = [] # empty list of rows
for i, row in enumerate(dataGenerator): # iterate through generator
data.append(row) # add row to data
if i+1==num_rows: # if you have iterated num_rows times break
break
dataFrame = pd.DataFrame(data, columns = columns) # create pd.DataFrame
# ----- MODIFIED ------- #
return dataFrame
def getColumnNamesFromSchema(schema):
colNames = []
for colData in schema:
colNames.append(colData['name'])
return colNames
Let me know if that helps at all.
Hi @tim-wright, thanks for your reply.
I tried your code with small sets (<100 rows) without success. After checking some settings on my computer I found it to be a connection issue due to the VPN.
I'm opening another thread about this (and webapps)