Create a Dataset from an excel file in Python recipe
Hi,
I am having trouble when creating a Dataset from an excel file. I am using the Rest API client in a Python recipe. I've suceeded in doing so from a csv file with this code :
client = dataiku.api_client()
project = client.get_project('******')
folder_path = '/Users/dmer/Desktop/*****/'
for file in os.listdir(folder_path):
if not file.endswith('.csv'):
continue
dataset = project.create_dataset(file[:-4] # dot is not allowed in dataset names
,'Filesystem'
, params={
'connection': 'filesystem_root'
,'path': folder_path + file
}, formatType='csv'
, formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
})
df = pd.read_csv(folder_path + file)
dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})
When i change it for excel format :
client = dataiku.api_client()
project = client.get_project('*****')
folder_path = '/Users/dmer/Desktop/*****/'
for file in os.listdir(folder_path):
if not file.endswith('.xls'):
continue
dataset = project.create_dataset(file[:-4] # dot is not allowed in dataset names
,'Filesystem'
, params={
'connection': 'filesystem_root'
,'path': folder_path + file
}, formatType='xls'
, formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
})
df = pd.read_excel(folder_path + file)
dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})
I get this exception :
DataikuException: com.dataiku.common.server.DKUControllerBase$MalformedRequestException: Could not parse a SerializedDataset from request body
Could you please help me ?
Thanx
Answers
-
Hi,
The best and simple way to populate a dataset is to:
1. create an empty "managed dataset" from the flow (one time visual operation),
2. add it as output to your recipe, and then
3. write to this output using one of the dataiku method such as .write_with_schema()
See https://doc.dataiku.com/dss/latest/code_recipes/python.html for more details.
Another option if you have many files would be to add all the files in a Dataiku Folder and then use the "files in folder" functionality. (https://doc.dataiku.com/dss/latest/connecting/files-in-folder.html)
Cheers,
Alex
-
Thank you, I kinda thought there was an "editable dataset" feature but I had no clue where to find it.
So if I got it right, the create_dataset method in the API can only work with csv files ? I couldn't find any detailed documentation on its parameters. -
In order to get the right parameters to create a certain type of dataset, I advise to:
1. create the dataset type you want from the flow
2. use the dataiku API to get the dataset JSON definition for that given type
3. use part of that JSON to create new datasets using the dataiku API
Please find an example below for a dataset on the filesystem_managed connection:
dataset = project.create_dataset(
dataset_name
,'Filesystem'
, params={
u'connection': u'filesystem_managed',
u'filesSelectionRules': {u'excludeRules': [],
u'explicitFiles': [],
u'includeRules': [],
u'mode': u'ALL'},
u'path': path_dataset
},
formatType='csv',
formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
}
) -
Hey,
I did what you suggested but I can't write_with_schema on these editable datasets : they are read-only. How can I overcome this ? -
Editable datasets currently cannot be modified using the dataiku API. I have logged this in our backlog. Right now, I would advise using the Sync recipe to sync the editable dataset to a "regular" dataset. And write_with_schema on the regular dataset.