Incorrect results using API to build dataset
Hello,
I am attempting to create a dataset, a recipe that populates the dataset, and a job that executes the recipe all through the Python REST API (dataikuapi).
I have successfully completed all these steps (see code below) including a successful run of the job to execute the recipe and build the dataset.
However, the data populated into the dataset by API initiated job is incorrect. There should be 4 columns but I only see one (col_0). There do seem to be the correct number of rows but they are all blank.
When I open the recipe in the UI and click the Run button the dataset is populated correctly. This suggests that the job portion of the script is missing something.
Any ideas?
Thanks!
Marlan
### Create output dataset output_dataset_name = 'TMP_TD2NZ_CSV_FILE' params = {'connection': 'filesystem_folders', 'path': '${projectKey}/' + output_dataset_name} format_params = {'separator': '\t', 'style': 'unix', 'compress': ''} output_dataset = project.create_dataset(output_dataset_name, type='Filesystem', params=params, formatType='csv', formatParams=format_params) # Set as managed dataset ds_def = output_dataset.get_definition() ds_def['managed'] = True output_dataset.set_definition(ds_def) ### Create SQL query recipe to populate the CSV file builder = SingleOutputRecipeCreator('sql_query', "TMP_TD2NZ_BUILD_CSV_FILE", project) builder = builder.with_input(input_dataset_name) builder = builder.with_output(output_dataset_name) recipe = builder.build() # Update the recipe to set its script recipe_def = recipe.get_definition_and_payload() query = "SELECT * FROM {0};".format(input_table_name) recipe_def.set_payload(query) recipe.set_definition_and_payload(recipe_def) ### Run job to build CSV dataset job_def = { 'type' : 'NON_RECURSIVE_FORCED_BUILD', 'outputs' : [{'id' : output_dataset_name, 'partition' : 'NP'}] } job = project.start_job(job_def) state = '' while state != 'DONE' and state != 'FAILED' and state != 'ABORTED': time.sleep(1) state = job.get_status()['baseStatus']['state']
Best Answer
-
Hi,
Your code does not set the schema of the output dataset.
When you click run from the recipe screen, DSS computes the schema of the query, and uses this to fill the schema of the output dataset. Note that if you instead ran from the Flow, you would experience the same issue.
This schema computation part is not currently available through the API, a workaround could be to run your query with a limit 1 in a SQLExecutor2 so as to retrieve a dataframe from which you could set the schema of the output dataset.
Answers
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron
Ah OK, I have to define the schema. I can work with that. I was planning to query system tables to get the SQL source table column specs anyway. Thanks for your help!