How to unzip files in a remote Managed Folder (as Azure File Share) using python?
I have zip files in a remote Managed Folder (as Azure File Share) and I need to unzip these files using Dataiku (Python) in this same folder to proceed with the flow.
However, I was not able to do it following the examples showed in the API documentation (Managed folders — Dataiku DSS 12 documentation)
Is it possible to do it using Dataiku?
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Luis_PB
,
Can you share your code snippet/error?How large is each file?
Will it fit into memory or would you need to stage these first to somewhere like tempdir and then unzipp and re-upload
https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#load-a-model-from-a-remote-managed-folderHere is an one example :
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu import zipfile # Read recipe inputs input_folder = dataiku.Folder("UsKtZnWF") # Replace input/output folder IDs in this I am using the same output_folder = dataiku.Folder("UsKtZnWF", ignore_flow=True) # Replace with the ID of the output folder # Specify the name of the zip file you want to unarchive zip_file_name = "archive.zip" # Check if the specified zip file exists in the input folder if zip_file_name in input_folder.list_paths_in_partition(): with input_folder.get_download_stream(zip_file_name) as file_stream: with zipfile.ZipFile(file_stream) as zip_file: for file_name in zip_file.namelist(): with zip_file.open(file_name) as extracted_file: output_folder.upload_stream(file_name, extracted_file) print(f"Unzipping {zip_file_name} and re-uploading files complete.") else: print(f"{zip_file_name} not found in the input folder.")
-
Hi @AlexT
,Thank you for the answer!
Before to post here, I have tried to follow the instructions that you shared here (Managed folders - Dataiku Developer Guide). However, I got this error and in your example too:
As far I was able to track the error, this piece of code returns an "urllib3.response.HTTPResponse" that is not expected by the "zipfile.ZipFile()".
with input_folder.get_download_stream(zip_file_name) as file_stream:
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Zipfile needs seekable object so it doesn't work directly you need to use BytesIO or Tempfile
-
Hi @AlexT
,Thanks for the tip. I solved the error adding this line of code:
with input_folder.get_download_stream(zip_file_name) as file_stream:
f = io.BytesIO(file_stream.read())
with zipfile.ZipFile(f) as zip_file:Thanks again for the help!