Exporting partitioned dataset to CSV files

Options
Rémi
Rémi Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 ✭✭✭

Hello!

I'm trying to export a partitioned dataset (date range of months) to multiple CSV files.

However, I would like them all to end up in the same directory, with different file names, instead of ending up in sub folders, named according to the partitions.

In others words, I would want the folders in the following image to be files.

temp.png

The current pattern I use in the output folder settings is: %Y-%M/.* - which is why I get the above result.

However, when I try to use something like this "%Y-%M.*" instead to get the result I would like, DSS throws the following error when running the export recipe.

temp.png

Thank you very much in advance for your answer(s)!

Rémi

Tagged:

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17 Answer ✓
    Options

    Hi @Rémi
    ,

    This is not currently possible with the Visual recipes hence the error "Partitioning scheme is not representable as folders". You will have to use a Python recipe to define the name of the files you want to export each partition for.

    Here is a sample Python recipe that will write a file for each partition in a dataset based on a filter e.g 2022

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    
    output_folder_id = "replace_folder_id"
    dataset_name = "replace_dataset_name"
    dataset  = dataiku.Dataset(dataset_name, ignore_flow=True)
    df = dataset.get_dataframe()
    
    # Write recipe outputs
    output_folder = dataiku.Folder(output_folder_id)
    folder_exports_info = output_folder.get_info()
    
    partitions =dataset.list_partitions(raise_if_empty=True)
    dataset_partitions_df = {}
    for partition in partitions:
        # filter here e.g want 2022 
        if partition.startswith('2022'):
            dataset = dataiku.Dataset(dataset_name, ignore_flow=True)
            dataset.add_read_partitions(partition)
            dataset_partition_df = dataset.get_dataframe()
            dataset_partitions_df[partition] = dataset_partition_df
            #upload the csv to managed folder : 
            output_folder.upload_stream(partition + ".csv", dataset_partition_df.to_csv(index=False).encode("utf-8"))
            

    This creates files with the full partition name:
    Screenshot 2022-06-03 at 12.20.30.png

    Let me know if that helps!

Answers

  • Rémi
    Rémi Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 ✭✭✭
    Options

    Thank you very much for your answer Alex!

    Your Python code is very similar to what I ended up doing to bypass the problem

    Have a great day & weekend!

    Rémi

Setup Info
    Tags
      Help me…