Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I would like to create massive dataiku dataset using python interpretor, without using creating them manually in the recipe
Note: The following command works only if I have created a dataiku dataset called "myoutputdataset" in my recipe. But, my problem is to create a new dataiku Dataset with out creating it before in my recipe and save my pandas dataframe in it
output_ds = dataiku.Dataset("myoutputdataset")
output_ds.write_with_schema(my_dataframe)
Hi,
"myoutputdataset" and "my_dataframe" are just placeholders that need to be changed with your own names / code.
For instance, the following (complete) recipe has a output DSS dataset called "results" which is filled by a Pandas dataframe called "o":
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd
# Recipe inputs
titanic = dataiku.Dataset("titanic")
df = titanic.get_dataframe()
# Some Python code
# ...
o = df.sort('PassengerId')
# Recipe outputs
output = dataiku.Dataset("results")
output.write_with_schema(o)
Hope this helps.
@N_JAYANTH Any luck in finding a solution?
Exception: None: b'dataset does not exist: EGMED.s22'
I know that error happens because there is no s22 place holder but my question is can I create that place hold automatically?
The "results" Dataset is not created by the Python code, but when you create your Recipe first:
import dataiku
import pandas as pd, numpy as np
# EXPOSE CLIENT AND CURRENT PROJECT IN ORDER TO CREATE NEW DATASETS
client = dataiku.api_client()
project = client.get_default_project()
# CREATE NEW DATASET -- RETURN DATAFRAME OF CREATED DATASET
def createDataset(datasetName, schema_columns=None, data=None, ignoreFlow=True):
new_Builder = project.new_managed_dataset(datasetName)
new_Builder.with_store_into("filesystem_folders")
new_Dataset = new_Builder.create(overwrite=True) # WILL OVERWRITE AN EXISTING DATASET OF THE SAME NAME
new_Dataset_settings = new_Dataset.get_settings()
new_Dataset_settings.set_csv_format()
columnCount = 2
if schema_columns is None:
new_Dataset_settings.add_raw_schema_column({'name':'Default Int', 'type':'int'})
new_Dataset_settings.add_raw_schema_column({'name':'Default String', 'type':'string'})
else:
columnCount = len(schema_columns)
for column in schema_columns:
new_Dataset_settings.add_raw_schema_column(column)
new_Dataset_settings.save()
new_Dataset = dataiku.Dataset(datasetName)
try:
if data is not None:
writer = new_Dataset.get_writer()
for row in data:
rowCellCount = len(row)
rowToAdd = []
iterativeLimit = 0
if columnCount > rowCellCount:
iterativeLimit = rowCellCount
else:
iterativeLimit = columnCount
for i in range(0, iterativeLimit):
rowToAdd.append(row[i])
writer.write_row_array((rowToAdd))
else:
writer = new_Dataset.get_writer()
writer.write_row_array((0, "_"))
except:
try:
writer.close()
except:
pass
try:
writer.close()
except:
pass
if ignoreFlow:
outputDataset = dataiku.Dataset(datasetName, ignore_flow=True) # for use in flow
return outputDataset.get_dataframe()
else:
outputDataset = dataiku.Dataset(datasetName) # Notebook testing
return outputDataset.get_dataframe()
myData = [
[1, "blah", "aaaaaaaaaaaaa"],
[2, "blah blah"],
[3, "blah blah blah"]
]
myColumns = [
{'name':'Integers Here', 'type':'int'},
{'name':'super special column', 'type':'string'}
]
createDataset("A_Great_Name", myColumns, myData, False)