Upload Parquet file to Dataiku Managed Folder using Python API

Solved!
sagar_dubey
Level 1
Upload Parquet file to Dataiku Managed Folder using Python API

Hi,

We've been trying to upload a parquet file to Dataiku's Managed folder but facing UnicodeDecodeError. I tried uploading csv format files which are working as expected but not able to upload parquet files. In my use case I need to upload parquet files. Is there any way we can upload the parquet files.

Below is the screenshot of the error.

Best,

Sagar

image.png

 

0 Kudos
1 Solution
AlexT
Dataiker

Hi,

Could you confirm how you are trying to copy the parquet file to the manged folder? Perhaps by sharing a code snippet. 

Using get_download_stream and upload_stream should work if the input file is already parquet format. 

Here is a simple example of copying a parquet file.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
input_folder = dataiku.Folder("uWgyw8kG")
output_folder =  dataiku.Folder("Ni8VoNvi")
filename = "userdata1.parquet"



parquet_file = input_folder.get_download_stream(filename)

output_folder.upload_stream("userdata1_copied.parquet", parquet_file)



 

If you need to convert a dataset to parquet then you can use something like this, please note for this you will need to add pyarrow to your code env 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import io

# Read recipe inputs
orders = dataiku.Dataset("orders")
orders_df = orders.get_dataframe()


# define managed folder output
output_folder =  dataiku.Folder("uWgyw8kG")
output_filename = "orders.parquet"

#convert to parquet

f = io.BytesIO()

orders_df.to_parquet(f)

#write output

output_folder.upload_data(output_filename, f.getvalue())

 

Let me know if this helps or please share the code that is generating the error. 

View solution in original post

0 Kudos
3 Replies
fchataigner2
Dataiker

Hi,

can you attach the python code used to to the upload? In particular, how the source parquet file is opened of fetched, and which upload method is used

0 Kudos
AlexT
Dataiker

Hi,

Could you confirm how you are trying to copy the parquet file to the manged folder? Perhaps by sharing a code snippet. 

Using get_download_stream and upload_stream should work if the input file is already parquet format. 

Here is a simple example of copying a parquet file.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
input_folder = dataiku.Folder("uWgyw8kG")
output_folder =  dataiku.Folder("Ni8VoNvi")
filename = "userdata1.parquet"



parquet_file = input_folder.get_download_stream(filename)

output_folder.upload_stream("userdata1_copied.parquet", parquet_file)



 

If you need to convert a dataset to parquet then you can use something like this, please note for this you will need to add pyarrow to your code env 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import io

# Read recipe inputs
orders = dataiku.Dataset("orders")
orders_df = orders.get_dataframe()


# define managed folder output
output_folder =  dataiku.Folder("uWgyw8kG")
output_filename = "orders.parquet"

#convert to parquet

f = io.BytesIO()

orders_df.to_parquet(f)

#write output

output_folder.upload_data(output_filename, f.getvalue())

 

Let me know if this helps or please share the code that is generating the error. 

0 Kudos
sagar_dubey
Level 1
Author

Thanks  @AlexT and @fchataigner2  for the response.

@AlexT  your approach worked for me. Since my file was already in a parquet format I used the upload_stream to upload my file to the Managed folder. 

Thanks.

0 Kudos