Incorrect results using API to build dataset

Marlan
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron
edited July 16 in Using Dataiku

Hello,

I am attempting to create a dataset, a recipe that populates the dataset, and a job that executes the recipe all through the Python REST API (dataikuapi).

I have successfully completed all these steps (see code below) including a successful run of the job to execute the recipe and build the dataset.

However, the data populated into the dataset by API initiated job is incorrect. There should be 4 columns but I only see one (col_0). There do seem to be the correct number of rows but they are all blank.

When I open the recipe in the UI and click the Run button the dataset is populated correctly. This suggests that the job portion of the script is missing something.

Any ideas?

Thanks!

Marlan

### Create output dataset
output_dataset_name = 'TMP_TD2NZ_CSV_FILE'
params = {'connection': 'filesystem_folders', 'path': '${projectKey}/' + output_dataset_name}
format_params = {'separator': '\t', 'style': 'unix', 'compress': ''}

output_dataset = project.create_dataset(output_dataset_name, type='Filesystem', params=params, 
                                     formatType='csv', formatParams=format_params)

# Set as managed dataset
ds_def = output_dataset.get_definition()
ds_def['managed'] = True
output_dataset.set_definition(ds_def)

### Create SQL query recipe to populate the CSV file
builder = SingleOutputRecipeCreator('sql_query', "TMP_TD2NZ_BUILD_CSV_FILE", project)
builder = builder.with_input(input_dataset_name)
builder = builder.with_output(output_dataset_name)
recipe = builder.build()        

# Update the recipe to set its script
recipe_def = recipe.get_definition_and_payload()
query = "SELECT * FROM {0};".format(input_table_name)
recipe_def.set_payload(query)
recipe.set_definition_and_payload(recipe_def)

### Run job to build CSV dataset
job_def = {
    'type' : 'NON_RECURSIVE_FORCED_BUILD', 
    'outputs' : [{'id' : output_dataset_name, 'partition' : 'NP'}]
    }
job = project.start_job(job_def)
state = ''
while state != 'DONE' and state != 'FAILED' and state != 'ABORTED':
        time.sleep(1)
        state = job.get_status()['baseStatus']['state']

Tagged:

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Answer ✓

    Hi,

    Your code does not set the schema of the output dataset.

    When you click run from the recipe screen, DSS computes the schema of the query, and uses this to fill the schema of the output dataset. Note that if you instead ran from the Flow, you would experience the same issue.

    This schema computation part is not currently available through the API, a workaround could be to run your query with a limit 1 in a SQLExecutor2 so as to retrieve a dataframe from which you could set the schema of the output dataset.

Answers

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron

    Ah OK, I have to define the schema. I can work with that. I was planning to query system tables to get the SQL source table column specs anyway. Thanks for your help!

Setup Info
    Tags
      Help me…