Read xlsx. File from managed folder using Python

Hi team,
I'm using a Python recipe in Dataiku to read a specific .xlsx
file from a managed folder, but I'm encountering an error when trying to load the file into a DataFrame.
Here’s a simplified version of my code:
folder = dataiku.Folder("FOLDER_ID") file_list = folder.list_paths_in_partition() last_month_str = "YYYYMM" # Format looks like this pattern = re.compile(r"^/FILENAME (\d{6}).*.xlsx$", re.IGNORECASE) for file_path in file_list: match = pattern.match(file_path) if match and match.group(1) == last_month_str: with folder.get_download_stream(file_path) as stream: file_content = stream.read() df = pd.read_excel(io.BytesIO(file_content), header=2)
Error:
Error in Python process: At line 29: <class 'ImportError'>: Missing optional dependency 'xlrd'. Install xlrd >= 2.0.1 for xls Excel support Use pip or conda to install xlrd.
Thanks in advance
Shirin
Operating system used: Windows
Operating system used: Windows
Answers
-
Hi:)
I hope that you are doing well. Thank you very much for reaching out.
In your code we can see the following
df = pd.read_excel(io.BytesIO(file_content), header=2)
If you don't specify an engine to be used above, the xldr is used by default (see: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)
Note that xlrd removed support for anything other than .xls files from version 2.0 (docs), hence you will need to use xlrd <1.2.0 in your code to be able to read xlsx files or you will need to specify openpyxl engine to be used.
So the error you are seeing is :
"Error in Python process: At line 29: <class 'ImportError'>: Missing optional dependency 'xlrd'. Install xlrd >= 2.0.1 for xls Excel support Use pip or conda to install xlrd."
From a first look and without more background on your installation this error is pretty straight forward you are simply missing a dependency since in your code you are not specifying any engine, xlrd is used (hence why it refers to xls excel in the error message as well)
Here i would recommend you to specify the engine to use to be openpyxl and install it in your code environment. You can do so following previous guidance:
Best,
Yasmine