new dataset write error

cmjurs
cmjurs Registered Posts: 20 ✭✭✭✭

Im trying to build a piece of code that runs as a recipe. The code builds 2 datasets, one of which is a database connection, and then implements a sql query to build out the other. My issue is that when using the recipe I keep getting the message when I try to write to the dataset that it cant be found. Its obviously there because I see it in the flow from where the dataset was instantiated. Here is the basic code

# build database connection
dataset_params = {
    "connection": connection,
    "schema": schema_name,
    "table": table_name
}

dataset_teradata = project.get_dataset(connection_dataset_name)
if not dataset_teradata.exists():
    dataset_teradata = project.create_dataset(
        dataset_name=connection_dataset_name,
        type=type_database_connection,
        params=dataset_params,
        formatParams={}
    )

# create csv dataset shell
path_to_training_full = project_id + '/' + training_dataset_name
params = {'connection': 'filesystem_managed', 'path': path_to_training_full}
format_params = {'separator': '\t', 'style': 'unix', 'compress': ''}

training_full = project.get_dataset(training_dataset_name)
if not training_full.exists():
    training_full = project.create_dataset(training_dataset_name,
                                           type='Filesystem',
                                           params=params,
                                           formatType='csv',
                                           formatParams=format_params)
    ds_def = training_full.get_definition()
    ds_def['managed'] = True
    training_full.set_definition(ds_def)

# get th sql code
sql_file_path = 'sql_file_name + '.sql'
fd = open(sql_file_path, 'r')
sqlFile = fd.read()
fd.close()

# use the sql code to make a pandas dataframe
executor = SQLExecutor2(dataset=dataiku.Dataset(connection_dataset_name, project_key=project_id, ignore_flow=True))
training_full_df = executor.query_to_df(sqlFile)

# Try to write the pandas dataframe to the csv dataset
dataset_training_full = dataiku.Dataset(training_dataset_name, project_key=project_id, ignore_flow=True)
dataset_training_full.write_with_schema(training_full_df)

the error:

 Oops: an unexpected error occurred
Error in Python process: At line 145: <class 'Exception'>: Dataset TEST_PLUGIN_DEV_CJ.training_full cannot be used : declare it as input or output of your recipe 

The funny thing is that if I run the recipe twice in a row... it works (and if I run it in a notebook, it works the first time)

Does anyone have experience trying to make datasets with code? Eventually I want this to be a plugin, so that's why I was trying to run it as a recipe. Maybe there is a better option?!

Thanks

CJ


Operating system used: Ubuntu

Tagged:

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron

    I am guessing that if you are creating datasets on the fly you might need to update the recipe via the API to include the new input/output. Have a look at the dataset API methods. This check is only enforced for recipes hence why it works on a Notebook.

Setup Info
    Tags
      Help me…