NOTE: this post deals with files-based partitioning
Partitioning can be an intimidating field for beginners because of the numerous options and the resulting surprising behaviours (e.g. see this thread for an example). The sync recipe being the entry point for partitioning datasets, it is important for users to feel comfortable when using it. However, we have identified several aspects to the sync recipe that, in our opinion, could be improved to avoid confusion.
1st improvement : remove the partition configuration placeholder next to the gear wheel when partitioning a dataset with the "redispatch" option
When partititioning a dataset using a sync recipe with the "redispatch" option activated, the partition placeholder above the "RUN" button (which can be configured using the gear wheel) is compulsory (otherwise the sync recipe cannot be ran) but its content is irrelevant!
This issue is pointed out in this hands-on tutorial note: "The Recipe run option requires you to define a partition to build. The value you specify does not impact the output, it is simply a “dummy” value. The Partition Redispatch feature computes all possible partitions within the data, building all partitions at once, regardless of the value you specify."
Nonetheless, removing the partition placeholder when activating the "redispatch option" would avoid any confusion.
2nd improvement : prevent the sync recipe from running and display an explicit error in the following cases :
3rd improvement : make dataiku insensitive to quotes when listing partitions in the partition configuration placeholder (concerns all kind of recipes, not just the sync recipe)
At least for us, it made sense to list partitions by surrounding them with quotes as we thought we were manipulating strings (especially if the partitions are numbers stored in a string format).
However, dataiku fails to recover the partitions when using quotes, we have to remove them to make the recipe work.
It would be nice if dataiku made this transparent (make the recipe work regardless of the presence or absence of quotes).
4th improvement : place the dependency reliationship between the input and ouput, not underneath the input dataset
Placing the dependency relationship under the input dataset gives the impression that the partitions will be built using the partitions of the input dataset.
However, the partitions can be built based on the existing output partitions (as in the above screenshot). So the user can start with a false a priori on how the partitions will be built (at least that was how me and my colleagues understood things at first).
5th improvement : improve the TEST button under the dependency relationship
The text displayed by the TEST button is not always very informative. See for example the following screenshot for the « EQUALS » relationship :
It would be nice to provide more information with potential caveats to keep in mind (for example, in the above example, the user may not be aware that a partition in the input dataset and absent of the output dataset will not be built when this recipe is ran with recursive build).
Maybe we could even imagine that dataiku would list the partitions that would be built by :
This feature would greatly help users better understand how the dependency mechanism work and save them time by avoiding later surprises.