Smart indexing: recommend index based on downstream recipes
It would be helpful if on the index selection menu for a dataset, some smart values could be displayed based on downstream recipes, and if in the recipe creation views, upstream datasets could be reindexed to optimize them as well.
For example, after I've created two join recipes downstream of a dataset, on the index screen in dataset settings, I'd like join columns from each recipe to be highlighted, suggesting index values based on which joins would be optimized. Same goes for filters, windows, groups, etc. If combined with the
This could also be used to recommend secondary indexes (including value-ordered NUSIs for inequality joins and filters).
Similarly, if I'm creating a recipe, I'd like a button to automatically optimize the indexes of some or all upstream datasets for my recipe. Or in the case of a join recipe, for a specific join.
Comments
-
I came here to post this. In a way, I'm disappointed this was requested so long ago and nobody has given it much love. I'm on MSSQL, and in a few cases I've had to create indexes manually in the pre- and post- build scripts section. It's tedious to do, but sometimes is essential to making a downstream operation efficient. Please add automagic index creation/use.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
It takes years for someone to learn enough of a database technology to master the art of balancing creating indexes for performance and keeping inserts/updates fast and storage costs under control. Indexes are also not a magic solution and often need to be changed due to query pattern changes, rebuilt if they get corrupted, stats recalculated or simply dropped if they don't get used enough. None of this is trivial to automate because if it was the database vendors would have done it themselves and they clearly have not.
Expecting that Dataiku does this for all the legacy database technologies they support (or the major ones) in an "automagic" way is really unrealistic and utopian. Even if it was possible, which it is not, the cost would certainly not justify the return on investment.
The final point to make is that the market is moving away from these legacy database technologies into newer technologies which do not use indexes for performance like Databricks, Snowflake or Google's BigQuery. So that's really where you should be focusing your time and money.