Create a Dataset from an excel file in Python recipe

dmer
dmer Registered Posts: 4 ✭✭✭✭
edited July 18 in Using Dataiku

Hi,

I am having trouble when creating a Dataset from an excel file. I am using the Rest API client in a Python recipe. I've suceeded in doing so from a csv file with this code :


client = dataiku.api_client()
project = client.get_project('******')
folder_path = '/Users/dmer/Desktop/*****/'
for file in os.listdir(folder_path):
if not file.endswith('.csv'):
continue
dataset = project.create_dataset(file[:-4] # dot is not allowed in dataset names
,'Filesystem'
, params={
'connection': 'filesystem_root'
,'path': folder_path + file
}, formatType='csv'
, formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
})
df = pd.read_csv(folder_path + file)
dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})

When i change it for excel format :


client = dataiku.api_client()
project = client.get_project('*****')
folder_path = '/Users/dmer/Desktop/*****/'
for file in os.listdir(folder_path):
if not file.endswith('.xls'):
continue
dataset = project.create_dataset(file[:-4] # dot is not allowed in dataset names
,'Filesystem'
, params={
'connection': 'filesystem_root'
,'path': folder_path + file
}, formatType='xls'
, formatParams={
'separator': ','
,'style': 'excel' # excel-style quoting
,'parseHeaderRow': True
})
df = pd.read_excel(folder_path + file)
dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})

I get this exception :


DataikuException: com.dataiku.common.server.DKUControllerBase$MalformedRequestException: Could not parse a SerializedDataset from request body

Could you please help me ?

Thanx

Answers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    The best and simple way to populate a dataset is to:

    1. create an empty "managed dataset" from the flow (one time visual operation),

    2. add it as output to your recipe, and then

    3. write to this output using one of the dataiku method such as .write_with_schema()

    See https://doc.dataiku.com/dss/latest/code_recipes/python.html for more details.

    Another option if you have many files would be to add all the files in a Dataiku Folder and then use the "files in folder" functionality. (https://doc.dataiku.com/dss/latest/connecting/files-in-folder.html)

    Cheers,

    Alex

  • dmer
    dmer Registered Posts: 4 ✭✭✭✭
    Thank you, I kinda thought there was an "editable dataset" feature but I had no clue where to find it.
    So if I got it right, the create_dataset method in the API can only work with csv files ? I couldn't find any detailed documentation on its parameters.
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    In order to get the right parameters to create a certain type of dataset, I advise to:
    1. create the dataset type you want from the flow
    2. use the dataiku API to get the dataset JSON definition for that given type
    3. use part of that JSON to create new datasets using the dataiku API

    Please find an example below for a dataset on the filesystem_managed connection:

    dataset = project.create_dataset(
    dataset_name
    ,'Filesystem'
    , params={
    u'connection': u'filesystem_managed',
    u'filesSelectionRules': {u'excludeRules': [],
    u'explicitFiles': [],
    u'includeRules': [],
    u'mode': u'ALL'},
    u'path': path_dataset
    },
    formatType='csv',
    formatParams={
    'separator': ','
    ,'style': 'excel' # excel-style quoting
    ,'parseHeaderRow': True
    }
    )
  • dmer
    dmer Registered Posts: 4 ✭✭✭✭
    Hey,
    I did what you suggested but I can't write_with_schema on these editable datasets : they are read-only. How can I overcome this ?
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Editable datasets currently cannot be modified using the dataiku API. I have logged this in our backlog. Right now, I would advise using the Sync recipe to sync the editable dataset to a "regular" dataset. And write_with_schema on the regular dataset.
Setup Info
    Tags
      Help me…