How to unzip files in a remote Managed Folder (as Azure File Share) using python?

Luis_PB
Level 2
How to unzip files in a remote Managed Folder (as Azure File Share) using python?

I have zip files in a remote Managed Folder (as Azure File Share) and I need to unzip these files using Dataiku (Python) in this same folder to proceed with the flow. 

However, I was not able to do it following the examples showed in the API documentation (Managed folders โ€” Dataiku DSS 12 documentation)

 

Is it possible to do it using Dataiku?

0 Kudos
4 Replies
AlexT
Dataiker

Hi @Luis_PB ,
Can you share your code snippet/error?

How large is each file?

Will it fit into memory or would you need to stage these first to somewhere like tempdir and then unzipp and re-upload 

https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#load-a-model-from-a-...

Here is an one example : 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import zipfile

# Read recipe inputs
input_folder = dataiku.Folder("UsKtZnWF")

# Replace input/output folder IDs in this I am using the same
output_folder = dataiku.Folder("UsKtZnWF", ignore_flow=True) # Replace with the ID of the output folder

# Specify the name of the zip file you want to unarchive
zip_file_name = "archive.zip"

# Check if the specified zip file exists in the input folder
if zip_file_name in input_folder.list_paths_in_partition():
    with input_folder.get_download_stream(zip_file_name) as file_stream:
        with zipfile.ZipFile(file_stream) as zip_file:
            for file_name in zip_file.namelist():
                with zip_file.open(file_name) as extracted_file:
                    output_folder.upload_stream(file_name, extracted_file)
    print(f"Unzipping {zip_file_name} and re-uploading files complete.")
else:
    print(f"{zip_file_name} not found in the input folder.")

 

0 Kudos
Luis_PB
Level 2
Author

Hi @AlexT

Thank you for the answer!

Before to post here, I have tried to follow the instructions that you shared here (Managed folders - Dataiku Developer Guide). However, I got this error and in your example too:

zip error.PNG

As far I was able to track the error, this piece of code returns an "urllib3.response.HTTPResponse" that is not expected by the "zipfile.ZipFile()". 

 

with input_folder.get_download_stream(zip_file_name) as file_stream:

 

 

 

0 Kudos
AlexT
Dataiker

Zipfile needs seekable object so it doesn't work directly you need to use BytesIO or Tempfile 

0 Kudos
Luis_PB
Level 2
Author

Hi @AlexT

Thanks for the tip. I solved the error adding this line of code: 

with input_folder.get_download_stream(zip_file_name) as file_stream:
    f = io.BytesIO(file_stream.read())
    with zipfile.ZipFile(f) as zip_file:

 

Thanks again for the help!

0 Kudos

Labels

?

Setup info

?
A banner prompting to get Dataiku