Uploading multiple sheets/files at once

Options
Romana
Romana Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 15 ✭✭✭✭
Hi,

I am trying to upload multiple excel files with multiple sheets at the same time into one dataset. Is it possible to add a special column indicating the original dataset (excel file + sheet)?

Many thanks
Tagged:

Answers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    edited July 18
    Options

    Hi,

    If columns are different across sheets, you will need to upload the dataset multiple times for each set of sheets with the same columns. Another option would be to put the file in a managed folder and use several files-in-folder datasets. Then you can use a stack recipes to remap the columns as necessary.

    Alternatively, you could read the file with python from a DSS managed folder as such:


    import dataiku
    import pandas as pd, numpy as np
    import os

    folder_path = dataiku.Folder("MYFOLDER").get_path()
    excel_path = os.path.join(folder_path, "MYFILE.xlsx")
    sheet_list = ["SHEET1", "SHEET2"]


    df_dict = {
    k: pd.read_excel(excel_path, sheet_name = k, engine = "xlrd")
    for k in sheet_list
    }


    for k, df in df_dict.items():
    df["origin_sheet_name"] = k


    df_stacked = pd.concat([v for k,v in df_dict.items()], axis = 0, ignore_index = True, sort = False)

    Hope it helps,

    Alex

  • Romana
    Romana Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 15 ✭✭✭✭
    Options
    Hi Alex,
    many thanks for your answer, however the code does not seem to work.
    Even though I specified the list of sheets for option "sheet_name=", it always reads only the first excel sheet. Even when I specifically select one particular sheet (other than first) it gives me the same result - first sheet.
    Do you know what could be the issue here?

    Romana
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    Hi, I would advise to test with various options of pandas read_excel() for instance the engine.
  • AshleyW
    AshleyW Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker
    Options

    Hi,

    Updating this thread to let you know that Dataiku know lets you use the sheet name of an Excel as a column in the dataset. When you configure the format of an uploaded Excel, there is a checkbox 'Add the sheet name as an output column' which will do just that. The steps are outlined in this tutorial.

    I hope it helps!

    Cheers,

    Ashley

Setup Info
    Tags
      Help me…