How to unzip files in a remote Managed Folder (as Azure File Share) using python?

Luis Paulo
Luis Paulo Registered Posts: 6 ✭✭✭✭

I have zip files in a remote Managed Folder (as Azure File Share) and I need to unzip these files using Dataiku (Python) in this same folder to proceed with the flow.

However, I was not able to do it following the examples showed in the API documentation (Managed folders — Dataiku DSS 12 documentation)

Is it possible to do it using Dataiku?

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
    edited July 17

    Hi @Luis_PB
    ,
    Can you share your code snippet/error?

    How large is each file?

    Will it fit into memory or would you need to stage these first to somewhere like tempdir and then unzipp and re-upload

    https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#load-a-model-from-a-remote-managed-folder

    Here is an one example :

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import zipfile
    
    # Read recipe inputs
    input_folder = dataiku.Folder("UsKtZnWF")
    
    # Replace input/output folder IDs in this I am using the same
    output_folder = dataiku.Folder("UsKtZnWF", ignore_flow=True) # Replace with the ID of the output folder
    
    # Specify the name of the zip file you want to unarchive
    zip_file_name = "archive.zip"
    
    # Check if the specified zip file exists in the input folder
    if zip_file_name in input_folder.list_paths_in_partition():
        with input_folder.get_download_stream(zip_file_name) as file_stream:
            with zipfile.ZipFile(file_stream) as zip_file:
                for file_name in zip_file.namelist():
                    with zip_file.open(file_name) as extracted_file:
                        output_folder.upload_stream(file_name, extracted_file)
        print(f"Unzipping {zip_file_name} and re-uploading files complete.")
    else:
        print(f"{zip_file_name} not found in the input folder.")
    

  • Luis Paulo
    Luis Paulo Registered Posts: 6 ✭✭✭✭
    edited July 17

    Hi @AlexT
    ,

    Thank you for the answer!

    Before to post here, I have tried to follow the instructions that you shared here (Managed folders - Dataiku Developer Guide). However, I got this error and in your example too:

    zip error.PNG

    As far I was able to track the error, this piece of code returns an "urllib3.response.HTTPResponse" that is not expected by the "zipfile.ZipFile()".

    with input_folder.get_download_stream(zip_file_name) as file_stream:

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Zipfile needs seekable object so it doesn't work directly you need to use BytesIO or Tempfile

  • Luis Paulo
    Luis Paulo Registered Posts: 6 ✭✭✭✭

    Hi @AlexT
    ,

    Thanks for the tip. I solved the error adding this line of code:

    with input_folder.get_download_stream(zip_file_name) as file_stream:
    f = io.BytesIO(file_stream.read())
    with zipfile.ZipFile(f) as zip_file:

    Thanks again for the help!

Setup Info
    Tags
      Help me…