Improve the sync recipe when working with partitions

tanguy · ‎10-18-2022

NOTE: this post deals with files-based partitioning

Partitioning can be an intimidating field for beginners because of the numerous options and the resulting surprising behaviours (e.g. see this thread for an example). The sync recipe being the entry point for partitioning datasets, it is important for users to feel comfortable when using it. However, we have identified several aspects to the sync recipe that, in our opinion, could be improved to avoid confusion.

1st improvement : remove the partition configuration placeholder next to the gear wheel when partitioning a dataset with the "redispatch" option

When partititioning a dataset using a sync recipe with the "redispatch" option activated, the partition placeholder above the "RUN" button (which can be configured using the gear wheel) is compulsory (otherwise the sync recipe cannot be ran) but its content is irrelevant!

This issue is pointed out in this hands-on tutorial note: "The Recipe run option requires you to define a partition to build. The value you specify does not impact the output, it is simply a “dummy” value. The Partition Redispatch feature computes all possible partitions within the data, building all partitions at once, regardless of the value you specify."

Nonetheless, removing the partition placeholder when activating the "redispatch option" would avoid any confusion.

2nd improvement : prevent the sync recipe from running and display an explicit error in the following cases :

When partitioning a non-partitioned dataset and when the partitioning pattern relies on dimension(s) that do not exist in the input dataset --> this is a dead end: the job will fail to build the dataset.
When using a sync recipe on a partitioned dataset : prevent the user from activating the « redispatch » option --> this is also a dead end: dataiku will not build the dataset (but this is a silent error: the job will not fail!)
- Note : we have actually done this with a large partitioned dataset and crashed our dataiku instance. More precisely, we accidently used the sync recipe with the « ALL AVAILABLE » dependency relationship and the « redispatch » option on. The job ran for hours raising warnings for each input row and the server’s disk got saturated by the logs.

3rd improvement : make dataiku insensitive to quotes when listing partitions in the partition configuration placeholder (concerns all kind of recipes, not just the sync recipe)

At least for us, it made sense to list partitions by surrounding them with quotes as we thought we were manipulating strings (especially if the partitions are numbers stored in a string format).

However, dataiku fails to recover the partitions when using quotes, we have to remove them to make the recipe work.

It would be nice if dataiku made this transparent (make the recipe work regardless of the presence or absence of quotes).

4th improvement : place the dependency reliationship between the input and ouput, not underneath the input dataset

Placing the dependency relationship under the input dataset gives the impression that the partitions will be built using the partitions of the input dataset.

However, the partitions can be built based on the existing output partitions (as in the above screenshot). So the user can start with a false a priori on how the partitions will be built (at least that was how me and my colleagues understood things at first).

5th improvement : improve the TEST button under the dependency relationship

The text displayed by the TEST button is not always very informative. See for example the following screenshot for the « EQUALS » relationship :

It would be nice to provide more information with potential caveats to keep in mind (for example, in the above example, the user may not be aware that a partition in the input dataset and absent of the output dataset will not be built when this recipe is ran with recursive build).

Maybe we could even imagine that dataiku would list the partitions that would be built by :

Parsing the existing partitions
Given an input or output partition selected by the user
List the corresponding input/output partitions

This feature would greatly help users better understand how the dependency mechanism work and save them time by avoiding later surprises.

cc @ElieA

ElieA · ‎11-08-2022

Thanks for your idea, @tanguy Your idea meets the criteria for submission, we'll reach out should we require more information.

If you’re reading this and think this would be a great capability to add to DSS, be sure to kudos the original post!

Take care

Improve the sync recipe when working with partitions

Labels

Data Exploration and Preparation

Designer Experience

Consistent display of chart title when hover on chart tab

I want to use Dataiku in Japanese.

Programmatic Git Support (Shell, Python API or Both)

Method to re-order V12 Visual ML override rules