Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have zip files in a remote Managed Folder (as Azure File Share) and I need to unzip these files using Dataiku (Python) in this same folder to proceed with the flow.
However, I was not able to do it following the examples showed in the API documentation (Managed folders — Dataiku DSS 12 documentation)
Is it possible to do it using Dataiku?
Hi @Luis_PB ,
Can you share your code snippet/error?
How large is each file?
Will it fit into memory or would you need to stage these first to somewhere like tempdir and then unzipp and re-upload
https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#load-a-model-from-a-...
Here is an one example :
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import zipfile
# Read recipe inputs
input_folder = dataiku.Folder("UsKtZnWF")
# Replace input/output folder IDs in this I am using the same
output_folder = dataiku.Folder("UsKtZnWF", ignore_flow=True) # Replace with the ID of the output folder
# Specify the name of the zip file you want to unarchive
zip_file_name = "archive.zip"
# Check if the specified zip file exists in the input folder
if zip_file_name in input_folder.list_paths_in_partition():
with input_folder.get_download_stream(zip_file_name) as file_stream:
with zipfile.ZipFile(file_stream) as zip_file:
for file_name in zip_file.namelist():
with zip_file.open(file_name) as extracted_file:
output_folder.upload_stream(file_name, extracted_file)
print(f"Unzipping {zip_file_name} and re-uploading files complete.")
else:
print(f"{zip_file_name} not found in the input folder.")
Hi @AlexT,
Thank you for the answer!
Before to post here, I have tried to follow the instructions that you shared here (Managed folders - Dataiku Developer Guide). However, I got this error and in your example too:
As far I was able to track the error, this piece of code returns an "urllib3.response.HTTPResponse" that is not expected by the "zipfile.ZipFile()".
with input_folder.get_download_stream(zip_file_name) as file_stream:
Zipfile needs seekable object so it doesn't work directly you need to use BytesIO or Tempfile
Hi @AlexT,
Thanks for the tip. I solved the error adding this line of code:
with input_folder.get_download_stream(zip_file_name) as file_stream:
f = io.BytesIO(file_stream.read())
with zipfile.ZipFile(f) as zip_file:
Thanks again for the help!