Smart indexing: recommend index based on downstream recipes

natejgardner
natejgardner Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 151 Neuron

It would be helpful if on the index selection menu for a dataset, some smart values could be displayed based on downstream recipes, and if in the recipe creation views, upstream datasets could be reindexed to optimize them as well.

For example, after I've created two join recipes downstream of a dataset, on the index screen in dataset settings, I'd like join columns from each recipe to be highlighted, suggesting index values based on which joins would be optimized. Same goes for filters, windows, groups, etc. If combined with the , this could also help identify which areas need the performance boost the most.

This could also be used to recommend secondary indexes (including value-ordered NUSIs for inequality joins and filters).

Similarly, if I'm creating a recipe, I'd like a button to automatically optimize the indexes of some or all upstream datasets for my recipe. Or in the case of a join recipe, for a specific join.

1
1 votes

New · Last Updated

Comments

  • Jason
    Jason Registered Posts: 33 ✭✭✭✭✭
    edited September 26

    I came here to post this. In a way, I'm disappointed this was requested so long ago and nobody has given it much love. I'm on MSSQL, and in a few cases I've had to create indexes manually in the pre- and post- build scripts section. It's tedious to do, but sometimes is essential to making a downstream operation efficient. Please add automagic index creation/use.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

    It takes years for someone to learn enough of a database technology to master the art of balancing creating indexes for performance and keeping inserts/updates fast and storage costs under control. Indexes are also not a magic solution and often need to be changed due to query pattern changes, rebuilt if they get corrupted, stats recalculated or simply dropped if they don't get used enough. None of this is trivial to automate because if it was the database vendors would have done it themselves and they clearly have not.

    Expecting that Dataiku does this for all the legacy database technologies they support (or the major ones) in an "automagic" way is really unrealistic and utopian. Even if it was possible, which it is not, the cost would certainly not justify the return on investment.

    The final point to make is that the market is moving away from these legacy database technologies into newer technologies which do not use indexes for performance like Databricks, Snowflake or Google's BigQuery. So that's really where you should be focusing your time and money.

Setup Info
    Tags
      Help me…