Create N output datasets dynamically

info-rchitect Registered Posts: 169 ✭✭✭✭✭✭


I have a dataset which I want to partition into N datasets, where N will change over time. N is > 30 so I don't want to have to manually declare each output dataset in my Python recipe. It is easy enough in Python to create the N dataframes I want to use as the source for each dataset. Can I do this dynamically without declaring each output dataset manually?


Operating system used: Windows 10

Best Answers

  • June
    June Dataiku DSS Core Designer, Registered Posts: 19 ✭✭✭✭
    Answer ✓

    This can be done using python in a scenario.

    Here is some sample code which dynamically creates & names tables and writes them as Dataiku tables.

    """From the Superstore Toy Data, create seperate datasets for each city"""import dataikufrom dataiku import pandasutils as pdufrom dataiku import api_clientimport datetime as dtimport numpy as npimport pandas as pd#Instantiate the clientclient=api_client()proj = client.get_default_project()#Manage where the output data will be storedMY_DB_CNXN = 'My_Database_Connection' #This is the name of a Dataiku database connection to write to ORlocal_filesystem = 'filesystem_managed' #This can be used to write to the local filesystemwrite_output_to = local_filesystem #By default we will use the local# Read recipe inputsSuperstore = dataiku.Dataset("Superstore")store_df = Superstore.get_dataframe()#Pre-Process Text, Get Unique City Valuesstore_df['City'] = store_df['City'].fillna(value='OTHER')store_df['City'] = store_df['City'].fillna('').astype(str).str.replace(r'[^A-Za-z ]', '', regex=True).replace('', np.nan, regex=False)store_df['City'] = store_df['City'].str.upper()store_df['City'] = store_df['City'].replace(' ', '_', regex=True)cities = list(store_df.City.unique())#Creating a small sample of unique cities so we only make 5 new datasets for this demosample = cities[0:4]for i in range(len(sample)):tbl_name = sample[i]df = store_df[store_df['City']==tbl_name]#get or create dataset associated with the table nameif any([ == tbl_name for x in proj.list_datasets()]):dataset = proj.get_dataset(tbl_name)else:builder = proj.new_managed_dataset(tbl_name)builder.with_store_into(write_output_to)dataset = builder.create()#write outputoutput_ds = dataiku.Dataset(tbl_name)output_ds.write_with_schema(df)

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,702 Neuron
    Answer ✓

    Could you please use a code block (the </> icon in the toolbar) to post your code snippet as it has lost all padding so it won’t execute properly in Python.


Setup Info
      Help me…