Python recipe to create a new Dataiku dataset

rs0105 Dataiku DSS Core Designer, Registered Posts: 5 ✭✭✭

I would like to create dataiku dataset using python recipe code, without using creating them manually in the recipe. I am able to do it through the notebook in Dataiku but fail to do so via the recipe as it is giving me the following error:

Dataset ABC cannot be used : declare it as input or output of your recipe

I am using the following code to create dataiku dataset through Notebook.

client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
project_variables = dataiku.get_custom_variables()
csv_dataset_name = 'ABC'
params = {'connection': 'xyz', 'path': project_variables['projectKey'] + '/' + csv_dataset_name}
format_params = {'separator': '\t', 'style': 'unix', 'compress': ''}
csv_dataset = project.create_dataset(csv_dataset_name, type='Filesystem', params=params, formatType='csv', formatParams=format_params)
ds_def = csv_dataset.get_definition()
ds_def['managed'] = True
output_file = csv_dataset_name
output_file = dataiku.Dataset(output_file)


  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    edited July 17


    Unfortunately, what you are trying to achieve is not possible. A recipe cannot "modify its own Flow".

    In order to guarantee consistency and isolation of jobs, each job runs on a consistent snapshot of the Flow. Adding datasets through the API does not them to the snapshot, so the recipe remains unaware that this dataset exists, and hence can't write into it.

    What you can do instead is use a "Python code" step in a scenario. Scenarios do not run on a snapshot and hence, Python steps can create datasets and write into them.

    Alternatively, you could have a scenario with:

    * First, a Python step that creates the dataset
    * Then a build step that runs the recipe

    Please note that in this latter case, you will need to use:

    dataset = dataiku.Dataset("my_new_dataset", ignore_flow=True)
    # ignore_flow=True indicates that you accept to write in a dataset that is not an output of the recipe. It's only needed in recipes
  • rs0105
    rs0105 Dataiku DSS Core Designer, Registered Posts: 5 ✭✭✭

    Thanks for the reply.

    I wanted to know what if I want to create a dataset having name in the format "ABCYYYYMMDDHHMMSS".

    Thanks in advance.

Setup Info
      Help me…