Manipulate Excel File contained in Sharepoint Managed Folder in a Python Recipe using openpyxl

Thai_Duong
Thai_Duong Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3

Hi Everyone !

I actually try to exctract a Excel file from a Managed Folder and to manupulate it with openpyxl to modify it and then upload it again. The special thing with this Managed Folder is that it's shared with SharePoint. So when i try to call the Excel file in my notebook i can't call any path.

in fact I'd like to use the .load_Workbook() method from openpyxl to manipulate it, but it needs a path ..

I tried several things with the download_stream() method of the Dataiku.Folder() Class but i can't transform the stream back in Excel.

Does anyone have an idea how i could perform this ?
Thanks !

Answers

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker
    edited July 17

    Hi @Thai_Duong
    ,

    The load_workbook() function also accepts a file object, but it requires the file object to support the seek method. Managed folders don't directly support seeking, but we can work around it by loading the file into memory first.

    The following example opens a spreadsheet from a managed folder, then saves it back to the folder:

    import io
    import shutil
    
    import dataiku
    import openpyxl
    
    folder = dataiku.Folder("YOUR_FOLDER")
    
    # Open an Excel file from the managed folder
    with folder.get_download_stream("SPREADSHEET.xlsx") as file:
        with io.BytesIO() as seekable:
            shutil.copyfileobj(file, seekable)
            wb = openpyxl.load_workbook(seekable)
    
    # Save the workbook back to the managed folder
    with io.BytesIO() as seekable:
        wb.save(seekable)
        seekable.seek(0)
        folder.upload_stream("NEW_SPREADSHEET.xlsx", seekable)

    Thanks,

    Zach

  • Thai_Duong
    Thai_Duong Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3
    edited July 17

    Hey Zac,
    Thanks for your reply !
    Just tried your code and i have the Following error

    ---------------------------------------------------------------------------
    BadZipFile                                Traceback (most recent call last)
    <ipython-input-16-48f6c83b7a08> in <module>
         10     with io.BytesIO() as seekable:
         11         shutil.copyfileobj(file, seekable)
    ---> 12         writer = pd.ExcelWriter(seekable, engine='openpyxl', mode='a')
         13         writer.book = openpyxl.load_workbook(seekable)
    
    /applis/19499-datah/DSS_DATA/code-envs/python/JeanPatarot_env/lib/python3.6/site-packages/pandas/io/excel/_openpyxl.py in __init__(self, path, engine, mode, **engine_kwargs)
         23             from openpyxl import load_workbook
         24 
    ---> 25             book = load_workbook(self.path)
         26             self.book = book
         27         else:
    
    /applis/19499-datah/DSS_DATA/code-envs/python/JeanPatarot_env/lib/python3.6/site-packages/openpyxl/reader/excel.py in load_workbook(filename, read_only, keep_vba, data_only, keep_links)
        314     """
        315     reader = ExcelReader(filename, read_only, keep_vba,
    --> 316                         data_only, keep_links)
        317     reader.read()
        318     return reader.wb
    
    /applis/19499-datah/DSS_DATA/code-envs/python/JeanPatarot_env/lib/python3.6/site-packages/openpyxl/reader/excel.py in __init__(self, fn, read_only, keep_vba, data_only, keep_links)
        122     def __init__(self,  fn, read_only=False, keep_vba=KEEP_VBA,
        123                   data_only=False, keep_links=True):
    --> 124         self.archive = _validate_archive(fn)
        125         self.valid_files = self.archive.namelist()
        126         self.read_only = read_only
    
    /applis/19499-datah/DSS_DATA/code-envs/python/JeanPatarot_env/lib/python3.6/site-packages/openpyxl/reader/excel.py in _validate_archive(filename)
         94             raise InvalidFileException(msg)
         95 
    ---> 96     archive = ZipFile(filename, 'r')
         97     return archive
         98 
    
    /usr/lib64/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
       1129         try:
       1130             if mode == 'r':
    -> 1131                 self._RealGetContents()
       1132             elif mode in ('w', 'x'):
       1133                 # set the modified flag so central directory gets written
    
    /usr/lib64/python3.6/zipfile.py in _RealGetContents(self)
       1196             raise BadZipFile("File is not a zip file")
       1197         if not endrec:
    -> 1198             raise BadZipFile("File is not a zip file")
       1199         if self.debug > 1:
       1200             print(endrec)
    
    BadZipFile: File is not a zip file

    Here i tried with the pandas Excel Writer, but even just with the load_workbook method, i had the same error.

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker

    Hi @Thai_Duong
    ,

    openpyxl raises this error if it doesn't recognize the file format. Are you sure that your Excel file is a XLSX/XLSM/XLTX/XLTM file?

    Note that openpyxl doesn't support the older XLS format.

  • Thai_Duong
    Thai_Duong Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3

    Hi @ZachM
    !
    Yes i'm sure, I really don't know where the error come from, i'll ask my Dataiku admins. Maybe it's related to the SharePoint folder.

Setup Info
    Tags
      Help me…