Exporting partitioned dataset to CSV files

Solved!
Rรฉmi
Level 1
Exporting partitioned dataset to CSV files

Hello!

 

I'm trying to export a partitioned dataset (date range of months) to multiple CSV files.

However, I would like them all to end up in the same directory, with different file names, instead of ending up in sub folders, named according to the partitions.

 

In others words, I would want the folders in the following image to be files.

 
 

temp.png

 

 

The current pattern I use in the output folder settings is: %Y-%M/.* - which is why I get the above result.

However, when I try to use something like this "%Y-%M.*" instead to get the result I would like, DSS throws the following error when running the export recipe.

temp.png

 

Thank you very much in advance for your answer(s)!

Rรฉmi

0 Kudos
1 Solution
AlexT
Dataiker

Hi @Rรฉmi,

This is not currently possible with the Visual recipes hence the error "Partitioning scheme is not representable as folders". You will have to use a Python recipe to define the name of the files you want to export each partition for. 

Here is a sample Python recipe that will write a file for each partition in a dataset based on a filter e.g 2022

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs

output_folder_id = "replace_folder_id"
dataset_name = "replace_dataset_name"
dataset  = dataiku.Dataset(dataset_name, ignore_flow=True)
df = dataset.get_dataframe()

# Write recipe outputs
output_folder = dataiku.Folder(output_folder_id)
folder_exports_info = output_folder.get_info()

partitions =dataset.list_partitions(raise_if_empty=True)
dataset_partitions_df = {}
for partition in partitions:
    # filter here e.g want 2022 
    if partition.startswith('2022'):
        dataset = dataiku.Dataset(dataset_name, ignore_flow=True)
        dataset.add_read_partitions(partition)
        dataset_partition_df = dataset.get_dataframe()
        dataset_partitions_df[partition] = dataset_partition_df
        #upload the csv to managed folder : 
        output_folder.upload_stream(partition + ".csv", dataset_partition_df.to_csv(index=False).encode("utf-8"))
        

 

This creates files with the full partition name:
Screenshot 2022-06-03 at 12.20.30.png

Let me know if that helps!

View solution in original post

2 Replies
AlexT
Dataiker

Hi @Rรฉmi,

This is not currently possible with the Visual recipes hence the error "Partitioning scheme is not representable as folders". You will have to use a Python recipe to define the name of the files you want to export each partition for. 

Here is a sample Python recipe that will write a file for each partition in a dataset based on a filter e.g 2022

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs

output_folder_id = "replace_folder_id"
dataset_name = "replace_dataset_name"
dataset  = dataiku.Dataset(dataset_name, ignore_flow=True)
df = dataset.get_dataframe()

# Write recipe outputs
output_folder = dataiku.Folder(output_folder_id)
folder_exports_info = output_folder.get_info()

partitions =dataset.list_partitions(raise_if_empty=True)
dataset_partitions_df = {}
for partition in partitions:
    # filter here e.g want 2022 
    if partition.startswith('2022'):
        dataset = dataiku.Dataset(dataset_name, ignore_flow=True)
        dataset.add_read_partitions(partition)
        dataset_partition_df = dataset.get_dataframe()
        dataset_partitions_df[partition] = dataset_partition_df
        #upload the csv to managed folder : 
        output_folder.upload_stream(partition + ".csv", dataset_partition_df.to_csv(index=False).encode("utf-8"))
        

 

This creates files with the full partition name:
Screenshot 2022-06-03 at 12.20.30.png

Let me know if that helps!

Rรฉmi
Level 1
Author

Thank you very much for your answer Alex!

 

Your Python code is very similar to what I ended up doing to bypass the problem ๐Ÿ˜‰

 

Have a great day & weekend!

 

Rรฉmi

 

 

0 Kudos