Python code to create a new Dataiku dataset
I would like to create massive dataiku dataset using python interpretor, without using creating them manually in the recipe
Note: The following command works only if I have created a dataiku dataset called "myoutputdataset" in my recipe. But, my problem is to create a new dataiku Dataset with out creating it before in my recipe and save my pandas dataframe in it
output_ds = dataiku.Dataset("myoutputdataset")
output_ds.write_with_schema(my_dataframe)
Answers
-
Hi,
"myoutputdataset" and "my_dataframe" are just placeholders that need to be changed with your own names / code.
For instance, the following (complete) recipe has a output DSS dataset called "results" which is filled by a Pandas dataframe called "o":
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd
# Recipe inputs
titanic = dataiku.Dataset("titanic")
df = titanic.get_dataframe()
# Some Python code
# ...
o = df.sort('PassengerId')
# Recipe outputs
output = dataiku.Dataset("results")
output.write_with_schema(o)Hope this helps.
-
I think you mis-understood my question. I know that "myoutputdataset" and "my_dataframe" are just placeholders. In your code
output = dataiku.Dataset("results")
what is "results". I suppose its a dataiku database, So you have already have a dataiku database named "results". Thats why you are able to write into it. My Question is how do you create the "results" database in dataiku using python code -
The "results" Dataset is not created by the Python code, but when you create your Recipe first:
-
The output dataset of a recipe is created in the recipe creation modal.
In case you really want to massively create datasets, there is an python API to administer DSS that you can use
http://doc.dataiku.com/dss/latest/api/public/index.html
Note that this API is NOT intended to be used to create the output dataset of a single recipe. -
So how do I create massive datasets like "results" without mentioning them in the recipe?
-
Yes @kenjil I would like to create massive datasets
-
This has nothing to do with the size of the dataset but with the number of datasets you want to create. There is not point using that API for creating a single dataset, whatever its size.
-
I want to create a large number of datasets, Is there any method to do this, please note I have a COMMUNITY EDITION license for DSS
-
I'm sorry. The admin API is not available in DSS Free Edition.
-
So there is no other way to create large number of datasets with DSS Free edition?
-
Note : If these datasets are linked to existing tables in a SQL connection, you can just mass create datasets for these tables in the connection settings UI in the administration of DSS.
-
What if my data files are csv files, Is there a way to convert a large number of csv files to large number of dataiku datasets? @kenjil
-
You can write some python code that reads your CSV files from a static path and then writes them individually into DSS.
-
But this is exactly was question I think. How can one create the "results" dataset using only the Python code inside the recipe? Or in other words: How can a Python recipe add outputs to itself?
-
@N_JAYANTH
Any luck in finding a solution? -
ibn-mohey Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4 ✭
Exception: None: b'dataset does not exist: EGMED.s22'
I know that error happens because there is no s22 place holder but my question is can I create that place hold automatically?
-
import dataiku import pandas as pd, numpy as np # EXPOSE CLIENT AND CURRENT PROJECT IN ORDER TO CREATE NEW DATASETS client = dataiku.api_client() project = client.get_default_project() # CREATE NEW DATASET -- RETURN DATAFRAME OF CREATED DATASET def createDataset(datasetName, schema_columns=None, data=None, ignoreFlow=True): new_Builder = project.new_managed_dataset(datasetName) new_Builder.with_store_into("filesystem_folders") new_Dataset = new_Builder.create(overwrite=True) # WILL OVERWRITE AN EXISTING DATASET OF THE SAME NAME new_Dataset_settings = new_Dataset.get_settings() new_Dataset_settings.set_csv_format() columnCount = 2 if schema_columns is None: new_Dataset_settings.add_raw_schema_column({'name':'Default Int', 'type':'int'}) new_Dataset_settings.add_raw_schema_column({'name':'Default String', 'type':'string'}) else: columnCount = len(schema_columns) for column in schema_columns: new_Dataset_settings.add_raw_schema_column(column) new_Dataset_settings.save() new_Dataset = dataiku.Dataset(datasetName) try: if data is not None: writer = new_Dataset.get_writer() for row in data: rowCellCount = len(row) rowToAdd = [] iterativeLimit = 0 if columnCount > rowCellCount: iterativeLimit = rowCellCount else: iterativeLimit = columnCount for i in range(0, iterativeLimit): rowToAdd.append(row[i]) writer.write_row_array((rowToAdd)) else: writer = new_Dataset.get_writer() writer.write_row_array((0, "_")) except: try: writer.close() except: pass try: writer.close() except: pass if ignoreFlow: outputDataset = dataiku.Dataset(datasetName, ignore_flow=True) # for use in flow return outputDataset.get_dataframe() else: outputDataset = dataiku.Dataset(datasetName) # Notebook testing return outputDataset.get_dataframe() myData = [ [1, "blah", "aaaaaaaaaaaaa"], [2, "blah blah"], [3, "blah blah blah"] ] myColumns = [ {'name':'Integers Here', 'type':'int'}, {'name':'super special column', 'type':'string'} ] createDataset("A_Great_Name", myColumns, myData, False)