Upload Parquet file to Dataiku Managed Folder using Python API
Hi,
We've been trying to upload a parquet file to Dataiku's Managed folder but facing UnicodeDecodeError. I tried uploading csv format files which are working as expected but not able to upload parquet files. In my use case I need to upload parquet files. Is there any way we can upload the parquet files.
Below is the screenshot of the error.
Best,
Sagar
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
Could you confirm how you are trying to copy the parquet file to the manged folder? Perhaps by sharing a code snippet.
Using get_download_stream and upload_stream should work if the input file is already parquet format.
Here is a simple example of copying a parquet file.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs input_folder = dataiku.Folder("uWgyw8kG") output_folder = dataiku.Folder("Ni8VoNvi") filename = "userdata1.parquet" parquet_file = input_folder.get_download_stream(filename) output_folder.upload_stream("userdata1_copied.parquet", parquet_file)
If you need to convert a dataset to parquet then you can use something like this, please note for this you will need to add pyarrow to your code env
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu import io # Read recipe inputs orders = dataiku.Dataset("orders") orders_df = orders.get_dataframe() # define managed folder output output_folder = dataiku.Folder("uWgyw8kG") output_filename = "orders.parquet" #convert to parquet f = io.BytesIO() orders_df.to_parquet(f) #write output output_folder.upload_data(output_filename, f.getvalue())
Let me know if this helps or please share the code that is generating the error.
Answers
-
Hi,
can you attach the python code used to to the upload? In particular, how the source parquet file is opened of fetched, and which upload method is used
-
Thanks @AlexT
and @fchataigner2
for the response.@AlexT
your approach worked for me. Since my file was already in a parquet format I used the upload_stream to upload my file to the Managed folder.Thanks.