Create a Dataset from an excel file in Python recipe

dmer · June 2019

Hi,

I am having trouble when creating a Dataset from an excel file. I am using the Rest API client in a Python recipe. I've suceeded in doing so from a csv file with this code :


client = dataiku.api_client()
project = client.get_project('******')
folder_path = '/Users/dmer/Desktop/*****/'
for file in os.listdir(folder_path):
    if not file.endswith('.csv'):
        continue
    dataset = project.create_dataset(file[:-4]  # dot is not allowed in dataset names
        ,'Filesystem'
        , params={
            'connection': 'filesystem_root'
            ,'path': folder_path + file
        }, formatType='csv'
        , formatParams={
            'separator': ','
            ,'style': 'excel'  # excel-style quoting
            ,'parseHeaderRow': True
        })
    df = pd.read_csv(folder_path + file)
    dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})

When i change it for excel format :


client = dataiku.api_client()
project = client.get_project('*****')
folder_path = '/Users/dmer/Desktop/*****/'
for file in os.listdir(folder_path):
    if not file.endswith('.xls'):
        continue
    dataset = project.create_dataset(file[:-4]  # dot is not allowed in dataset names
        ,'Filesystem'
        , params={
            'connection': 'filesystem_root'
            ,'path': folder_path + file
        }, formatType='xls'
        , formatParams={
            'separator': ','
            ,'style': 'excel'  # excel-style quoting
            ,'parseHeaderRow': True
        })
    df = pd.read_excel(folder_path + file)
    dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})

I get this exception :


DataikuException: com.dataiku.common.server.DKUControllerBase$MalformedRequestException: Could not parse a SerializedDataset from request body

Could you please help me ?

Thanx

Alex_Combessie · June 2019

Hi,

The best and simple way to populate a dataset is to:

1. create an empty "managed dataset" from the flow (one time visual operation),

2. add it as output to your recipe, and then

3. write to this output using one of the dataiku method such as .write_with_schema()

See https://doc.dataiku.com/dss/latest/code_recipes/python.html for more details.

Another option if you have many files would be to add all the files in a Dataiku Folder and then use the "files in folder" functionality. (https://doc.dataiku.com/dss/latest/connecting/files-in-folder.html)

Cheers,

Alex

dmer · June 2019

Thank you, I kinda thought there was an "editable dataset" feature but I had no clue where to find it.
So if I got it right, the create_dataset method in the API can only work with csv files ? I couldn't find any detailed documentation on its parameters.

Alex_Combessie · June 2019

In order to get the right parameters to create a certain type of dataset, I advise to:
1. create the dataset type you want from the flow
2. use the dataiku API to get the dataset JSON definition for that given type
3. use part of that JSON to create new datasets using the dataiku API

Please find an example below for a dataset on the filesystem_managed connection:

dataset = project.create_dataset(
dataset_name
,'Filesystem'
, params={
u'connection': u'filesystem_managed',
u'filesSelectionRules': {u'excludeRules': [],
u'explicitFiles': [],
u'includeRules': [],
u'mode': u'ALL'},
u'path': path_dataset
},
formatType='csv',
formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
}
)

dmer · June 2019

Hey,
I did what you suggested but I can't write_with_schema on these editable datasets : they are read-only. How can I overcome this ?

Alex_Combessie · June 2019

Editable datasets currently cannot be modified using the dataiku API. I have logged this in our backlog. Right now, I would advise using the Sync recipe to sync the editable dataset to a "regular" dataset. And write_with_schema on the regular dataset.

Create a Dataset from an excel file in Python recipe

Answers

Categories

Setup Info

Tags