Condensed dataset naming grammar

natejgardner · ‎02-23-2022

The default Dataiku dataset naming scheme is really meaningful and rich for tracing the lineage of a dataset and serves as a good start before a name is manually defined. However, it's also tends to get quite lengthy and sometimes even overflows my database's character limit.

It'd be great if, for large projects, there were a more some condensed grammar options at the project level for automatic naming.

For super condensed naming, each of the actions could simply be replaced with a symbol.

MASTSCHD_1_copy_by_line_num_stacked_by_line_num_joined_filtered_by_model_joined_by_load_min_min_joined_prepared could become something like MASTSCHD_1_cgUgxfgxgxT.

Then, for adding deeper meaning to the naming, there could be options to, for example, include the names of multiple datasets involved in a single recipe, or for filtered, grouped, or partitioned columns to be included in the dataset name without making it significantly longer.

For example, joins could be represented with the letter x and stacks with the letter U. So customers_joined could become customers__x__transactions. open_orders_stacked could become open_orders__U__filled_orders

These options would be useful in very large projects. While it's not a substitute for manually naming important datasets, in projects with even just hundreds of steps, automatic naming of intermediate datasets between major points is a key feature that could become a little more flexible to handle multiple scales of projects. The option for condensed grammar may also help to further distinguish key, manually-named datasets from intermediate steps.

ElisaS · ‎03-09-2022

Condensed dataset naming grammar

Labels

Data Exploration and Preparation

platform and infrastructure

I want to use Dataiku in Japanese.

Programmatic Git Support (Shell, Python API or Both)

Method to re-order V12 Visual ML override rules

Labeling > Support providing Annotations as optional Input