Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

Create a Dataset with Python Code

Level 4
Level 4
Create a Dataset with Python Code

Goal is to query the Redshift DB for table names and return a dropdown for users in a plugin.  

Problem: Where to store the interim table from SQL?

db_tables = dataiku.Dataset('db_tables')
SQLExecutor2.exec_recipe_fragment(db_tables, query)

: 'NoneType' object is not subscriptable

 It seems the Dataset object cannot be created this way? Is there a workaround? Aside from using an empty dataset as an input to the plugin?

4 Replies

Here's an example of how to create a dataset programatically. In this case a text dataset (I can share code for SQL if interested, it's similar but of course not exactly the same). I extracted this from a larger process so may not have gotten all the needed pieces but nonetheless should be a place to start. In particular, you'll need to populate the variable to pass the set_schema method. You can do a get_schema on an existing dataset to see the format.


import dataiku
import dataikuapi

client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
project_variables = dataiku.get_custom_variables()
csv_dataset_name = 'NEW_DATASET_NAME'

# Create dataset if it doesn't already exist
	# If dataset exists, clear it
	csv_dataset = project.get_dataset(csv_dataset_name) # doesn't generate error if dataset doesn't exist
	# Create dataset (assuming exception was that dataset does not exist)
	params = {'connection': 'filesystem_folders', 'path': project_variables['projectKey']  + '/' + csv_dataset_name}
	format_params = {'separator': '\t', 'style': 'unix', 'compress': ''}

	csv_dataset = project.create_dataset(csv_dataset_name, type='Filesystem', params=params,
										 formatType='csv', formatParams=format_params)

	# Set dataset to managed
	ds_def = csv_dataset.get_definition()
	ds_def['managed'] = True

# Set schema
csv_dataset.set_schema({'columns': csv_dku_schema_columns})

# If you want to delete it later...
csv_dataset.clear() # removes folder and file
Level 2

I think this is a useful example of how to create datasets dynamically by Python code.

However, I see now method how to write data from a Pandas dataset to the created Dataiku dataset?

I checked the dataikuapi reference, but could not find any applicable method.

Would be great if the example above could be extended to explain how to do realize it.

The example in the documentation shows following code:

project = client.get_project('TEST_PROJECT')
folder_path = 'path/to/folder/'
for file in listdir(folder_path😞
    if not file.endswith('.csv'😞
    dataset = project.create_dataset(file[:-4]  # dot is not allowed in dataset names
        , params={
            'connection': 'filesystem_root'
            ,'path': folder_path + file
        }, formatType='csv'
        , formatParams={
            'separator': ','
            ,'style': 'excel'  # excel-style quoting
            ,'parseHeaderRow': True
    df = pandas.read_csv(folder_path + file)
    dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]}

But unfortunately, the example doesn't actually show how to write the Pandas df .

 Thanks in advance!

0 Kudos

Hi @berndito,

Check for documentation on writing dataframes. See the method write_dataframe in the Dataset class.


Level 1

The error is self-explanatory. You are trying to index None. You can not, because 'NoneType' object is not subscriptable. This means that you tried to do:


In general, the error means that you attempted to index an object that doesn't have that functionality. You might have noticed that the method sort() that only modify the list have no return value printed – they return the default None. This is a design principle for all mutable data structures in Python.


0 Kudos
A banner prompting to get Dataiku DSS