Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
# Open sample file
input_managed_folder_id = "xxxxxx"
input_folder = dataiku.Folder(input_managed_folder_id)
input_file_name = 'lab_samples.parquet'
file_stream = input_folder.get_download_stream(input_file_name)
file_bytes = io.BytesIO(file_stream.read())
lab_samples = pd.read_parquet(file_bytes, engine='fastparquet')
lab_samples.head()
I figured it out after 5 hours of research... The original parquet file I had created was corrupted. I did not use the correct code to create it. Below is how you can create a parquet file in a Managed folder (S3 or Azure Blog Container).
output_file_name = 'lab_samples.parquet'
# Create file
output_managed_folder_id = "xxxxx" # Managed Folder
output_folder = dataiku.Folder(output_managed_folder_id)
f = io.BytesIO()
df.to_parquet(f)
f.seek(0)
content = f.read()
output_folder.upload_stream(output_file_name, content)
See the last example on the page:
https://pandas.pydata.org/pandas-docs/version/1.1/reference/api/pandas.DataFrame.to_parquet.html
This is how you read it:
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import io
# Open sample file
input_managed_folder_id = "xxxxx"
input_folder = dataiku.Folder(input_managed_folder_id)
input_file_name = 'lab_samples.parquet'
file_stream = input_folder.get_download_stream(input_file_name)
file_bytes = io.BytesIO(file_stream.read())
lab_samples = pd.read_parquet(file_bytes, engine='pyarrow')
lab_samples.head()