How do I read a parquet file from Managed Folder using Python API?

jsjaramillo
Level 1
How do I read a parquet file from Managed Folder using Python API?

# Open sample file
input_managed_folder_id = "xxxxxx"
input_folder = dataiku.Folder(input_managed_folder_id)
input_file_name = 'lab_samples.parquet'
file_stream = input_folder.get_download_stream(input_file_name)
file_bytes = io.BytesIO(file_stream.read())
lab_samples = pd.read_parquet(file_bytes, engine='fastparquet')

lab_samples.head()

0 Kudos
1 Reply
jsjaramillo
Level 1
Author

I figured it out after 5 hours of research... The original parquet file I had created was corrupted. I did not use the correct code to create it. Below is how you can create a parquet file in a Managed folder (S3 or Azure Blog Container). 


output_file_name = 'lab_samples.parquet'

# Create file
output_managed_folder_id = "xxxxx" # Managed Folder
output_folder = dataiku.Folder(output_managed_folder_id)
f = io.BytesIO()
df.to_parquet(f)
f.seek(0)
content = f.read()
output_folder.upload_stream(output_file_name, content)

See the last example on the page:

https://pandas.pydata.org/pandas-docs/version/1.1/reference/api/pandas.DataFrame.to_parquet.html

 

This is how you read it:

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import io

# Open sample file
input_managed_folder_id = "xxxxx"
input_folder = dataiku.Folder(input_managed_folder_id)
input_file_name = 'lab_samples.parquet'
file_stream = input_folder.get_download_stream(input_file_name)
file_bytes = io.BytesIO(file_stream.read())

lab_samples = pd.read_parquet(file_bytes, engine='pyarrow')

lab_samples.head()

 

0 Kudos