Join us on Wednesday, June 3rd for a deep dive into Customer Predictive Analytics Learn more

Create a Dataset with Python Code

Level 3
Create a Dataset with Python Code

Goal is to query the Redshift DB for table names and return a dropdown for users in a plugin.  

Problem: Where to store the interim table from SQL?

query = "SELECT * FROM PG_TABLES"
db_tables = dataiku.Dataset('db_tables')
SQLExecutor2.exec_recipe_fragment(db_tables, query)

TypeError
: 'NoneType' object is not subscriptable

 It seems the Dataset object cannot be created this way? Is there a workaround? Aside from using an empty dataset as an input to the plugin?

1 Reply
Level 3

Here's an example of how to create a dataset programatically. In this case a text dataset (I can share code for SQL if interested, it's similar but of course not exactly the same). I extracted this from a larger process so may not have gotten all the needed pieces but nonetheless should be a place to start. In particular, you'll need to populate the variable to pass the set_schema method. You can do a get_schema on an existing dataset to see the format.

Marlan

import dataiku
import dataikuapi

client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
project_variables = dataiku.get_custom_variables()
csv_dataset_name = 'NEW_DATASET_NAME'

# Create dataset if it doesn't already exist
try:
	# If dataset exists, clear it
	csv_dataset = project.get_dataset(csv_dataset_name) # doesn't generate error if dataset doesn't exist
	csv_dataset.clear()
except:
	# Create dataset (assuming exception was that dataset does not exist)
	params = {'connection': 'filesystem_folders', 'path': project_variables['projectKey']  + '/' + csv_dataset_name}
	format_params = {'separator': '\t', 'style': 'unix', 'compress': ''}

	csv_dataset = project.create_dataset(csv_dataset_name, type='Filesystem', params=params,
										 formatType='csv', formatParams=format_params)

	# Set dataset to managed
	ds_def = csv_dataset.get_definition()
	ds_def['managed'] = True
	csv_dataset.set_definition(ds_def)

# Set schema
csv_dataset.set_schema({'columns': csv_dku_schema_columns})

# If you want to delete it later...
csv_dataset.clear() # removes folder and file
csv_dataset.delete()