Partitioning not working on split recipe

pratikgujral-sf
Level 2
Partitioning not working on split recipe

Hi Community,

 

I am using a Split recipe in my Flow. The input to the split recipe is a partitioned dataset.

Input dataset to Split recipe

  • comes from an upstream task in the Flow and is stored on Server's Filesystem (filesystem_managed)
  • Partitioned on one column that has ten discrete values
  • Dataset -> Settings -> Partitioning -> List Partitions correctly lists ten partitions along with file size

Output dataset from Split recipe

  • Also stored on Server's Filesystem (filesystem_managed)
  • Partitions are not available. Dataset -> Settings -> Partitioning -> List Partitions only has one partition- Any instead of 10 partitions.

 

Settings in the Split Recipe

The Split recipe has a 'Split' applied based on the values of two flag columns. If the value in the train_flag column is True, then it is sent to one output dataset, and if the value in the test_flag column is True, the record is sent to the other dataset. All other records are dropped. Furthermore, in the Input/Output settings, I see Partitioned by the desired column- the column on which I indeed want the output datasets to be partitioned.

 

We need the two output datasets of the split recipe to be partitioned as well, on the same column as the input dataset. This is later to be used to train split models. What is the issue, and why are my output datasets not partitioned?


Operating system used: Red Hat Enterprise Linux

0 Kudos
1 Reply
AlexT
Dataiker

Hi @pratikgujral-sf,

From what you describe, it shouldn't be like you may not have selected the correct partition dependencies in your Split recipe. you check that "Equals" is select in the "Input / Output" of the Split recipe. 

Note that if a filesystem dataset is not partitioned, you need to partition it first sync recipe and the redispatch option:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-redispatch.html#:~:text=The%20R....

If this doesn't help and this is a still an issue for I would suggest you open a support ticket with the job diagnostics so we can further review. 

Thanks

0 Kudos