How do I read a parquet file from Managed Folder using Python API?
# Open sample file
input_managed_folder_id = "xxxxxx"
input_folder = dataiku.Folder(input_managed_folder_id)
input_file_name = 'lab_samples.parquet'
file_stream = input_folder.get_download_stream(input_file_name)
file_bytes = io.BytesIO(file_stream.read())
lab_samples = pd.read_parquet(file_bytes, engine='fastparquet')
lab_samples.head()
Answers
-
I figured it out after 5 hours of research... The original parquet file I had created was corrupted. I did not use the correct code to create it. Below is how you can create a parquet file in a Managed folder (S3 or Azure Blog Container).
output_file_name = 'lab_samples.parquet'# Create file
output_managed_folder_id = "xxxxx" # Managed Folder
output_folder = dataiku.Folder(output_managed_folder_id)
f = io.BytesIO()
df.to_parquet(f)
f.seek(0)
content = f.read()
output_folder.upload_stream(output_file_name, content)See the last example on the page:
https://pandas.pydata.org/pandas-docs/version/1.1/reference/api/pandas.DataFrame.to_parquet.html
This is how you read it:
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import io# Open sample file
input_managed_folder_id = "xxxxx"
input_folder = dataiku.Folder(input_managed_folder_id)
input_file_name = 'lab_samples.parquet'
file_stream = input_folder.get_download_stream(input_file_name)
file_bytes = io.BytesIO(file_stream.read())lab_samples = pd.read_parquet(file_bytes, engine='pyarrow')
lab_samples.head()