The Dataiku Frontrunner Awards have launched to recognize your achievements! SUBMIT YOUR ENTRY

Index cardinality UI

0 Kudos

When selecting an index for a dataset, it would be helpful if the UI could display the cardinality of the selected field or combination of fields used for indexing. This would help in determining how selective an index is. Even better, displaying the histograms generated by the analyze feature on the explore tab (but configured to treat the values categorically and displaying all values rather than just top) would allow both skew and cardinality to be examined for a given index selection. This would allow me to make sure I've selected a good index before building a dataset.

In Teradata, datasets are partitioned by their primary index across a number of AMPs, which are essentially distributed database nodes. Each AMP will contain records for its assigned index values. A user's memory and a database's allocated storage are divided equally across all AMPs, making primary index selectivity and skew very important. In my Teradata configuration, there are more than 500 AMPs. For large datasets, if I've selected a primary index poorly, I might wait several hours for a table to be fully loaded only for one AMP to run out of storage space, or spend a long time processing a query only for one AMP to run out of memory. The ideal distribution of data for Teradata stability is uniform, with an integer multiple of the number of AMPs in a system as the number of index values. Understanding how closely a selected primary index fits that optimal distribution is important before building a dataset. It's also helpful in understanding how well joins and filters will work against that dataset later on. Thus this would be a big time saver for Teradata users, though I imagine the analysis would be useful for understanding an index against any database system.