Condensed dataset naming grammar

Nate · February 2022

The default Dataiku dataset naming scheme is really meaningful and rich for tracing the lineage of a dataset and serves as a good start before a name is manually defined. However, it's also tends to get quite lengthy and sometimes even overflows my database's character limit.

It'd be great if, for large projects, there were a more some condensed grammar options at the project level for automatic naming.

For super condensed naming, each of the actions could simply be replaced with a symbol.

MASTSCHD_1_copy_by_line_num_stacked_by_line_num_joined_filtered_by_model_joined_by_load_min_min_joined_prepared could become something like MASTSCHD_1_cgUgxfgxgxT.

Then, for adding deeper meaning to the naming, there could be options to, for example, include the names of multiple datasets involved in a single recipe, or for filtered, grouped, or partitioned columns to be included in the dataset name without making it significantly longer.

For example, joins could be represented with the letter x and stacks with the letter U. So customers_joined could become customers__x__transactions. open_orders_stacked could become open_orders__U__filled_orders

These options would be useful in very large projects. While it's not a substitute for manually naming important datasets, in projects with even just hundreds of steps, automatic naming of intermediate datasets between major points is a key feature that could become a little more flexible to handle multiple scales of projects. The option for condensed grammar may also help to further distinguish key, manually-named datasets from intermediate steps.

Condensed dataset naming grammar

In the Backlog · Last Updated February 2022

Categories

Setup Info

Tags