Join us on July 16th as we explore real-world Reinforcement Learning Learn more

Incorrect results using API to build dataset

Level 3
Incorrect results using API to build dataset

Hello,

I am attempting to create a dataset, a recipe that populates the dataset, and a job that executes the recipe all through the Python REST API (dataikuapi).

I have successfully completed all these steps (see code below) including a successful run of the job to execute the recipe and build the dataset.

However, the data populated into the dataset by API initiated job is incorrect.  There should be 4 columns but I only see one (col_0). There do seem to be the correct number of rows but they are all blank.

When I open the recipe in the UI and click the Run button the dataset is populated correctly. This suggests that the job portion of the script is missing something. 

Any ideas?

Thanks!

Marlan

### Create output dataset
output_dataset_name = 'TMP_TD2NZ_CSV_FILE'
params = {'connection': 'filesystem_folders', 'path': '${projectKey}/' + output_dataset_name}
format_params = {'separator': '\t', 'style': 'unix', 'compress': ''}

output_dataset = project.create_dataset(output_dataset_name, type='Filesystem', params=params, 
                                     formatType='csv', formatParams=format_params)

# Set as managed dataset
ds_def = output_dataset.get_definition()
ds_def['managed'] = True
output_dataset.set_definition(ds_def)

### Create SQL query recipe to populate the CSV file
builder = SingleOutputRecipeCreator('sql_query', "TMP_TD2NZ_BUILD_CSV_FILE", project)
builder = builder.with_input(input_dataset_name)
builder = builder.with_output(output_dataset_name)
recipe = builder.build()        

# Update the recipe to set its script
recipe_def = recipe.get_definition_and_payload()
query = "SELECT * FROM {0};".format(input_table_name)
recipe_def.set_payload(query)
recipe.set_definition_and_payload(recipe_def)

### Run job to build CSV dataset
job_def = {
    'type' : 'NON_RECURSIVE_FORCED_BUILD', 
    'outputs' : [{'id' : output_dataset_name, 'partition' : 'NP'}]
    }
job = project.start_job(job_def)
state = ''
while state != 'DONE' and state != 'FAILED' and state != 'ABORTED':
        time.sleep(1)
        state = job.get_status()['baseStatus']['state']

 

0 Kudos
2 Replies
Dataiker
Dataiker

Hi,

Your code does not set the schema of the output dataset.

When you click run from the recipe screen, DSS computes the schema of the query, and uses this to fill the schema of the output dataset. Note that if you instead ran from the Flow, you would experience the same issue.

This schema computation part is not currently available through the API, a workaround could be to run your query with a limit 1 in a SQLExecutor2 so as to retrieve a dataframe from which you could set the schema of the output dataset.

0 Kudos
Level 3
Author

Ah OK, I have to define the schema. I can work with that. I was planning to query system tables to get the SQL source table column specs anyway. Thanks for your help! 

0 Kudos