Exporting partitioned dataset to CSV files

Rémi · ‎06-02-2022

Hello!

I'm trying to export a partitioned dataset (date range of months) to multiple CSV files.

However, I would like them all to end up in the same directory, with different file names, instead of ending up in sub folders, named according to the partitions.

In others words, I would want the folders in the following image to be files.

The current pattern I use in the output folder settings is: %Y-%M/.* - which is why I get the above result.

However, when I try to use something like this "%Y-%M.*" instead to get the result I would like, DSS throws the following error when running the export recipe.

Thank you very much in advance for your answer(s)!

Rémi

AlexT · ‎06-03-2022

Hi @Rémi,

This is not currently possible with the Visual recipes hence the error "Partitioning scheme is not representable as folders". You will have to use a Python recipe to define the name of the files you want to export each partition for.

Here is a sample Python recipe that will write a file for each partition in a dataset based on a filter e.g 2022

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs

output_folder_id = "replace_folder_id"
dataset_name = "replace_dataset_name"
dataset  = dataiku.Dataset(dataset_name, ignore_flow=True)
df = dataset.get_dataframe()

# Write recipe outputs
output_folder = dataiku.Folder(output_folder_id)
folder_exports_info = output_folder.get_info()

partitions =dataset.list_partitions(raise_if_empty=True)
dataset_partitions_df = {}
for partition in partitions:
    # filter here e.g want 2022 
    if partition.startswith('2022'):
        dataset = dataiku.Dataset(dataset_name, ignore_flow=True)
        dataset.add_read_partitions(partition)
        dataset_partition_df = dataset.get_dataframe()
        dataset_partitions_df[partition] = dataset_partition_df
        #upload the csv to managed folder : 
        output_folder.upload_stream(partition + ".csv", dataset_partition_df.to_csv(index=False).encode("utf-8"))

This creates files with the full partition name:

Let me know if that helps!

View solution in original post

AlexT · ‎06-03-2022

Hi @Rémi,

This is not currently possible with the Visual recipes hence the error "Partitioning scheme is not representable as folders". You will have to use a Python recipe to define the name of the files you want to export each partition for.

Here is a sample Python recipe that will write a file for each partition in a dataset based on a filter e.g 2022

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs

output_folder_id = "replace_folder_id"
dataset_name = "replace_dataset_name"
dataset  = dataiku.Dataset(dataset_name, ignore_flow=True)
df = dataset.get_dataframe()

# Write recipe outputs
output_folder = dataiku.Folder(output_folder_id)
folder_exports_info = output_folder.get_info()

partitions =dataset.list_partitions(raise_if_empty=True)
dataset_partitions_df = {}
for partition in partitions:
    # filter here e.g want 2022 
    if partition.startswith('2022'):
        dataset = dataiku.Dataset(dataset_name, ignore_flow=True)
        dataset.add_read_partitions(partition)
        dataset_partition_df = dataset.get_dataframe()
        dataset_partitions_df[partition] = dataset_partition_df
        #upload the csv to managed folder : 
        output_folder.upload_stream(partition + ".csv", dataset_partition_df.to_csv(index=False).encode("utf-8"))

This creates files with the full partition name:

Let me know if that helps!

Rémi · ‎06-03-2022

Thank you very much for your answer Alex!

Your Python code is very similar to what I ended up doing to bypass the problem 😉

Have a great day & weekend!

Rémi

Sign up to take part

Exporting partitioned dataset to CSV files

Exporting partitioned dataset to CSV files

Setup info