Community Conundrum 28: News Engagement is live! Read More

Create a Dataset from an excel file in Python recipe

Level 1
Create a Dataset from an excel file in Python recipe

Hi,



I am having trouble when creating a Dataset from an excel file. I am using the Rest API client in a Python recipe. I've suceeded in doing so from a csv file with this code :




client = dataiku.api_client()
project = client.get_project('******')
folder_path = '/Users/dmer/Desktop/*****/'
for file in os.listdir(folder_path):
if not file.endswith('.csv'):
continue
dataset = project.create_dataset(file[:-4] # dot is not allowed in dataset names
,'Filesystem'
, params={
'connection': 'filesystem_root'
,'path': folder_path + file
}, formatType='csv'
, formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
})
df = pd.read_csv(folder_path + file)
dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})


When i change it for excel format :




client = dataiku.api_client()
project = client.get_project('*****')
folder_path = '/Users/dmer/Desktop/*****/'
for file in os.listdir(folder_path):
if not file.endswith('.xls'):
continue
dataset = project.create_dataset(file[:-4] # dot is not allowed in dataset names
,'Filesystem'
, params={
'connection': 'filesystem_root'
,'path': folder_path + file
}, formatType='xls'
, formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
})
df = pd.read_excel(folder_path + file)
dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})


I get this exception :




DataikuException: com.dataiku.common.server.DKUControllerBase$MalformedRequestException: Could not parse a SerializedDataset from request body


Could you please help me ?



Thanx

0 Kudos
5 Replies
Dataiker
Dataiker

Hi,



The best and simple way to populate a dataset is to:



1. create an empty "managed dataset" from the flow (one time visual operation),



2. add it as output to your recipe, and then



3. write to this output using one of the dataiku method such as .write_with_schema()



See https://doc.dataiku.com/dss/latest/code_recipes/python.html for more details.



Another option if you have many files would be to add all the files in a Dataiku Folder and then use the "files in folder" functionality. (https://doc.dataiku.com/dss/latest/connecting/files-in-folder.html)



Cheers,



Alex

0 Kudos
Level 1
Author
Thank you, I kinda thought there was an "editable dataset" feature but I had no clue where to find it.
So if I got it right, the create_dataset method in the API can only work with csv files ? I couldn't find any detailed documentation on its parameters.
0 Kudos
Dataiker
Dataiker
In order to get the right parameters to create a certain type of dataset, I advise to:
1. create the dataset type you want from the flow
2. use the dataiku API to get the dataset JSON definition for that given type
3. use part of that JSON to create new datasets using the dataiku API

Please find an example below for a dataset on the filesystem_managed connection:

dataset = project.create_dataset(
dataset_name
,'Filesystem'
, params={
u'connection': u'filesystem_managed',
u'filesSelectionRules': {u'excludeRules': [],
u'explicitFiles': [],
u'includeRules': [],
u'mode': u'ALL'},
u'path': path_dataset
},
formatType='csv',
formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
}
)
0 Kudos
Level 1
Author
Hey,
I did what you suggested but I can't write_with_schema on these editable datasets : they are read-only. How can I overcome this ?
0 Kudos
Dataiker
Dataiker
Editable datasets currently cannot be modified using the dataiku API. I have logged this in our backlog. Right now, I would advise using the Sync recipe to sync the editable dataset to a "regular" dataset. And write_with_schema on the regular dataset.
0 Kudos
Labels (3)
A banner prompting to get Dataiku DSS