Read xlsx. File from managed folder using Python

ShGH
ShGH Registered Posts: 1

Hi team,

I'm using a Python recipe in Dataiku to read a specific .xlsx file from a managed folder, but I'm encountering an error when trying to load the file into a DataFrame.

Here’s a simplified version of my code:

folder = dataiku.Folder("FOLDER_ID")
file_list = folder.list_paths_in_partition()
last_month_str = "YYYYMM"  # Format looks like this
pattern = re.compile(r"^/FILENAME (\d{6}).*.xlsx$", re.IGNORECASE)

for file_path in file_list:
  match = pattern.match(file_path)
  if match and match.group(1) == last_month_str:
    with folder.get_download_stream(file_path) as stream:
      file_content = stream.read()
      df = pd.read_excel(io.BytesIO(file_content), header=2)

Error:

Error in Python process: At line 29: <class 'ImportError'>: Missing optional dependency 'xlrd'. Install xlrd >= 2.0.1 for xls Excel support Use pip or conda to install xlrd.

Thanks in advance

Shirin

Operating system used: Windows

Operating system used: Windows

Answers

  • Yasmine_T
    Yasmine_T Registered Posts: 71 ✭✭✭✭

    Hi:)

    I hope that you are doing well. Thank you very much for reaching out.

    In your code we can see the following

    df = pd.read_excel(io.BytesIO(file_content), header=2)
    

    If you don't specify an engine to be used above, the xldr is used by default (see: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)

    Note that xlrd removed support for anything other than .xls files from version 2.0 (docs), hence you will need to use xlrd <1.2.0 in your code to be able to read xlsx files or you will need to specify openpyxl engine to be used.

    So the error you are seeing is :

    "Error in Python process: At line 29: <class 'ImportError'>: Missing optional dependency 'xlrd'. Install xlrd >= 2.0.1 for xls Excel support Use pip or conda to install xlrd."

    From a first look and without more background on your installation this error is pretty straight forward you are simply missing a dependency since in your code you are not specifying any engine, xlrd is used (hence why it refers to xls excel in the error message as well)

    Here i would recommend you to specify the engine to use to be openpyxl and install it in your code environment. You can do so following previous guidance:

    Best,

    Yasmine

Setup Info
    Tags
      Help me…