Submit your use case or success story to the 2023 edition of the Dataiku Frontrunner Awards ENTER YOUR SUBMISSION

Partitioning not working on split recipe

Level 2
Partitioning not working on split recipe

Hi Community,


I am using a Split recipe in my Flow. The input to the split recipe is a partitioned dataset.

Input dataset to Split recipe

  • comes from an upstream task in the Flow and is stored on Server's Filesystem (filesystem_managed)
  • Partitioned on one column that has ten discrete values
  • Dataset -> Settings -> Partitioning -> List Partitions correctly lists ten partitions along with file size

Output dataset from Split recipe

  • Also stored on Server's Filesystem (filesystem_managed)
  • Partitions are not available. Dataset -> Settings -> Partitioning -> List Partitions only has one partition- Any instead of 10 partitions.


Settings in the Split Recipe

The Split recipe has a 'Split' applied based on the values of two flag columns. If the value in the train_flag column is True, then it is sent to one output dataset, and if the value in the test_flag column is True, the record is sent to the other dataset. All other records are dropped. Furthermore, in the Input/Output settings, I see Partitioned by the desired column- the column on which I indeed want the output datasets to be partitioned.


We need the two output datasets of the split recipe to be partitioned as well, on the same column as the input dataset. This is later to be used to train split models. What is the issue, and why are my output datasets not partitioned?

Operating system used: Red Hat Enterprise Linux

0 Kudos
0 Replies