Joining partitioned and non-partitioned datasets
Hi - I have been trying to run the example tutorial for File-based partitioning in dataiku academy.
However, unlike the tutorial, I do not want to specify any values as target variables in the 'Run Recipe Options'. I want to join the datasets based on all partitions exist in 'transactions_copy' dataset, so the resulting 'transactions_joined' dataset has the same partitioning like 'transactions_copy'. I thought by choosing 'all available' as the dependency partition function I can achieve this, but since the DSS automatically fills the target_identifier parameter with the date that the join recipe is running, all the data is stored as one partition in the transactions_joined dataset (only one activity is run in the job).
How can I circumvent this issue to have the same partitioning schema as the 'transactions_copy' as the result of the join recipe?
Operating system used: linux.x86_64
Answers
-
Miguel Angel Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 118 Dataiker
Hi,
In the exercise, when changing in the recipe the partition dependency to 'all available' it stops propagating the partitioning to 'transactions_joined'. Thus it becomes a non-partitioned dataset and building it becomes a just one activity job. You can go to the dataset settings and explicitely specify the partitioning setup.
Using 'all available' gathers all the partitions from the input dataset for each partition of the output dataset. This way it does not ask for partition identifiers. However, this means that all partitions of the input will be used to build each individual partition of the output.
On the other hand, the 'equals' partition dependency that the exercise uses propagates the partitioning to the 'transactions_joined'. Moreover, with this dependency a 1 to 1 relation between the input-output partitions is established, i.e. partition '2017-01-01' of the input will be used on building the '2017-01-01' partition on the output and so forth.
With this configuration DSS expects a partition identifyier. You can put a range in order to avoid naming individual partitions. For example, in the case of the exercise you can select all the partitions by using: 2017-01-01/ 2018-04-30
More information about partition dependencies and identifiers can be found on the help:
https://doc.dataiku.com/dss/latest/partitions/dependencies.html
https://doc.dataiku.com/dss/latest/partitions/identifiers.html