How to stack partitionned datasets with incoherent partitions?
Jean
Registered Posts: 1 ✭✭✭✭
Hi,
Let's say we have 2 datasets, partitioned by letters :
Dataset 1 :
- partition A
- partition B
Dataset 2 :
- partition B
- partition C
I would like to get a "summed" dataset, where existing partitions are stacked.
Dataset 3 :
- partition A (from 1)
- partition B (from 1 + 2)
- partition C (from 2)
Simply stacking those two datasets fails for partition A and C since one of the sources is empty for each.
Can I do that without scenario variables ?
Let's say we have 2 datasets, partitioned by letters :
Dataset 1 :
- partition A
- partition B
Dataset 2 :
- partition B
- partition C
I would like to get a "summed" dataset, where existing partitions are stacked.
Dataset 3 :
- partition A (from 1)
- partition B (from 1 + 2)
- partition C (from 2)
Simply stacking those two datasets fails for partition A and C since one of the sources is empty for each.
Can I do that without scenario variables ?
Tagged:
Answers
-
Hi,
You can make it work with a Python recipe and the dataiku API to create necessary partitions on the two datasets (1 and 2). You just need to define empty files for each missing partition. The way you have done it with the "append instead of overwrite" is OK but it is less robust that a Python recipe (that controls for the set of input partitions). Imagine one uses "append", discovers a bug, fixes it, runs the flow again: the output dataset would then contain both the buggy and the correct data.
Cheers,
Alex