Upload Parquet file to Dataiku Managed Folder using Python API

Options
sagar_dubey
sagar_dubey Partner, Registered Posts: 17 Partner

Hi,

We've been trying to upload a parquet file to Dataiku's Managed folder but facing UnicodeDecodeError. I tried uploading csv format files which are working as expected but not able to upload parquet files. In my use case I need to upload parquet files. Is there any way we can upload the parquet files.

Below is the screenshot of the error.

Best,

Sagar

image.png

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17 Answer ✓
    Options

    Hi,

    Could you confirm how you are trying to copy the parquet file to the manged folder? Perhaps by sharing a code snippet.

    Using get_download_stream and upload_stream should work if the input file is already parquet format.

    Here is a simple example of copying a parquet file.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    input_folder = dataiku.Folder("uWgyw8kG")
    output_folder =  dataiku.Folder("Ni8VoNvi")
    filename = "userdata1.parquet"
    
    
    
    parquet_file = input_folder.get_download_stream(filename)
    
    output_folder.upload_stream("userdata1_copied.parquet", parquet_file)
    
    
    
    

    If you need to convert a dataset to parquet then you can use something like this, please note for this you will need to add pyarrow to your code env

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import io
    
    # Read recipe inputs
    orders = dataiku.Dataset("orders")
    orders_df = orders.get_dataframe()
    
    
    # define managed folder output
    output_folder =  dataiku.Folder("uWgyw8kG")
    output_filename = "orders.parquet"
    
    #convert to parquet
    
    f = io.BytesIO()
    
    orders_df.to_parquet(f)
    
    #write output
    
    output_folder.upload_data(output_filename, f.getvalue())
    

    Let me know if this helps or please share the code that is generating the error.

Answers

Setup Info
    Tags
      Help me…