Manipulate Excel File contained in Sharepoint Managed Folder in a Python Recipe using openpyxl

Thai_Duong
Level 2
Manipulate Excel File contained in Sharepoint Managed Folder in a Python Recipe using openpyxl

Hi Everyone ! ๐Ÿฆ

I actually try to exctract a Excel file from a Managed Folder and to manupulate it with openpyxl to modify it and then upload it again. The special thing with this Managed Folder is that it's shared with SharePoint. So when i try to call  the Excel file in my notebook i can't call any path. 

in fact I'd like to use the .load_Workbook() method from openpyxl to manipulate it, but it needs a path .. 

I tried several things with the download_stream() method of the Dataiku.Folder() Class  but i can't transform the stream back in Excel. 

Does anyone have an idea how i could perform this ? 
Thanks ! ๐Ÿค 

0 Kudos
4 Replies
ZachM
Dataiker

Hi @Thai_Duong,

The load_workbook() function also accepts a file object, but it requires the file object to support the seek method. Managed folders don't directly support seeking, but we can work around it by loading the file into memory first.

The following example opens a spreadsheet from a managed folder, then saves it back to the folder:

import io
import shutil

import dataiku
import openpyxl

folder = dataiku.Folder("YOUR_FOLDER")

# Open an Excel file from the managed folder
with folder.get_download_stream("SPREADSHEET.xlsx") as file:
    with io.BytesIO() as seekable:
        shutil.copyfileobj(file, seekable)
        wb = openpyxl.load_workbook(seekable)

# Save the workbook back to the managed folder
with io.BytesIO() as seekable:
    wb.save(seekable)
    seekable.seek(0)
    folder.upload_stream("NEW_SPREADSHEET.xlsx", seekable)

 

Thanks,

Zach

Thai_Duong
Level 2
Author

Hey Zac, 
Thanks for your reply !
Just tried your code and i have the Following error 

 

---------------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
<ipython-input-16-48f6c83b7a08> in <module>
     10     with io.BytesIO() as seekable:
     11         shutil.copyfileobj(file, seekable)
---> 12         writer = pd.ExcelWriter(seekable, engine='openpyxl', mode='a')
     13         writer.book = openpyxl.load_workbook(seekable)

/applis/19499-datah/DSS_DATA/code-envs/python/JeanPatarot_env/lib/python3.6/site-packages/pandas/io/excel/_openpyxl.py in __init__(self, path, engine, mode, **engine_kwargs)
     23             from openpyxl import load_workbook
     24 
---> 25             book = load_workbook(self.path)
     26             self.book = book
     27         else:

/applis/19499-datah/DSS_DATA/code-envs/python/JeanPatarot_env/lib/python3.6/site-packages/openpyxl/reader/excel.py in load_workbook(filename, read_only, keep_vba, data_only, keep_links)
    314     """
    315     reader = ExcelReader(filename, read_only, keep_vba,
--> 316                         data_only, keep_links)
    317     reader.read()
    318     return reader.wb

/applis/19499-datah/DSS_DATA/code-envs/python/JeanPatarot_env/lib/python3.6/site-packages/openpyxl/reader/excel.py in __init__(self, fn, read_only, keep_vba, data_only, keep_links)
    122     def __init__(self,  fn, read_only=False, keep_vba=KEEP_VBA,
    123                   data_only=False, keep_links=True):
--> 124         self.archive = _validate_archive(fn)
    125         self.valid_files = self.archive.namelist()
    126         self.read_only = read_only

/applis/19499-datah/DSS_DATA/code-envs/python/JeanPatarot_env/lib/python3.6/site-packages/openpyxl/reader/excel.py in _validate_archive(filename)
     94             raise InvalidFileException(msg)
     95 
---> 96     archive = ZipFile(filename, 'r')
     97     return archive
     98 

/usr/lib64/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
   1129         try:
   1130             if mode == 'r':
-> 1131                 self._RealGetContents()
   1132             elif mode in ('w', 'x'):
   1133                 # set the modified flag so central directory gets written

/usr/lib64/python3.6/zipfile.py in _RealGetContents(self)
   1196             raise BadZipFile("File is not a zip file")
   1197         if not endrec:
-> 1198             raise BadZipFile("File is not a zip file")
   1199         if self.debug > 1:
   1200             print(endrec)

BadZipFile: File is not a zip file

 

Here i tried with the pandas Excel Writer, but even just with the load_workbook method, i had the same error. 

0 Kudos
ZachM
Dataiker

Hi @Thai_Duong,

openpyxl raises this error if it doesn't recognize the file format. Are you sure that your Excel file is a XLSX/XLSM/XLTX/XLTM file?

Note that openpyxl doesn't support the older XLS format.

0 Kudos
Thai_Duong
Level 2
Author

Hi @ZachM !
Yes i'm sure, I really don't know where the error come from, i'll ask my Dataiku admins. Maybe it's related to the SharePoint folder.

0 Kudos

Setup info

?
Tags (1)
A banner prompting to get Dataiku