How do I read a parquet file from Managed Folder using Python API?

Options
jsjaramillo
jsjaramillo Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 2 ✭✭✭

# Open sample file
input_managed_folder_id = "xxxxxx"
input_folder = dataiku.Folder(input_managed_folder_id)
input_file_name = 'lab_samples.parquet'
file_stream = input_folder.get_download_stream(input_file_name)
file_bytes = io.BytesIO(file_stream.read())
lab_samples = pd.read_parquet(file_bytes, engine='fastparquet')

lab_samples.head()

Tagged:

Answers

  • jsjaramillo
    jsjaramillo Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 2 ✭✭✭
    Options

    I figured it out after 5 hours of research... The original parquet file I had created was corrupted. I did not use the correct code to create it. Below is how you can create a parquet file in a Managed folder (S3 or Azure Blog Container).


    output_file_name = 'lab_samples.parquet'

    # Create file
    output_managed_folder_id = "xxxxx" # Managed Folder
    output_folder = dataiku.Folder(output_managed_folder_id)
    f = io.BytesIO()
    df.to_parquet(f)
    f.seek(0)
    content = f.read()
    output_folder.upload_stream(output_file_name, content)

    See the last example on the page:

    https://pandas.pydata.org/pandas-docs/version/1.1/reference/api/pandas.DataFrame.to_parquet.html

    This is how you read it:

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    import io

    # Open sample file
    input_managed_folder_id = "xxxxx"
    input_folder = dataiku.Folder(input_managed_folder_id)
    input_file_name = 'lab_samples.parquet'
    file_stream = input_folder.get_download_stream(input_file_name)
    file_bytes = io.BytesIO(file_stream.read())

    lab_samples = pd.read_parquet(file_bytes, engine='pyarrow')

    lab_samples.head()

Setup Info
    Tags
      Help me…