Exporting partitioned dataset to CSV files
Hello!
I'm trying to export a partitioned dataset (date range of months) to multiple CSV files.
However, I would like them all to end up in the same directory, with different file names, instead of ending up in sub folders, named according to the partitions.
In others words, I would want the folders in the following image to be files.
The current pattern I use in the output folder settings is: %Y-%M/.* - which is why I get the above result.
However, when I try to use something like this "%Y-%M.*" instead to get the result I would like, DSS throws the following error when running the export recipe.
Thank you very much in advance for your answer(s)!
Rémi
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Rémi
,This is not currently possible with the Visual recipes hence the error "Partitioning scheme is not representable as folders". You will have to use a Python recipe to define the name of the files you want to export each partition for.
Here is a sample Python recipe that will write a file for each partition in a dataset based on a filter e.g 2022
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs output_folder_id = "replace_folder_id" dataset_name = "replace_dataset_name" dataset = dataiku.Dataset(dataset_name, ignore_flow=True) df = dataset.get_dataframe() # Write recipe outputs output_folder = dataiku.Folder(output_folder_id) folder_exports_info = output_folder.get_info() partitions =dataset.list_partitions(raise_if_empty=True) dataset_partitions_df = {} for partition in partitions: # filter here e.g want 2022 if partition.startswith('2022'): dataset = dataiku.Dataset(dataset_name, ignore_flow=True) dataset.add_read_partitions(partition) dataset_partition_df = dataset.get_dataframe() dataset_partitions_df[partition] = dataset_partition_df #upload the csv to managed folder : output_folder.upload_stream(partition + ".csv", dataset_partition_df.to_csv(index=False).encode("utf-8"))
This creates files with the full partition name:
Let me know if that helps!
Answers
-
Rémi Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 ✭✭✭
Thank you very much for your answer Alex!
Your Python code is very similar to what I ended up doing to bypass the problem
Have a great day & weekend!
Rémi