Unique key detector tool

simon_aubert
simon_aubert Dataiku DSS Core Designer, Registered Posts: 13 ✭✭✭

What's your use case?
More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.

What's your proposed solution?
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe.

How do I imagine that ?

Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows

Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".

ex :

order_id

line_id

amount

customer

site

1

1

100

A

U_250

1

2

12

A

U_250

1

3

45

A

U_250

2

1

75

A

U_250

2

2

12

A

U_250

3

1

15

B

U_250

4

1

45

B

U_251

The user will select every field but excluding Amount (he knows that Amount would have no sense in key)

The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields

to match the number of row (7)
And gives something like that

choice

number of fields

field combination

very good

2

order_id,line_id

average

3

order_id,line_id, customer

average

3

order_id,line_id, site

bad

4

order_id,line_id, site, customer

….

Are there any alternative solutions?
N/A

Best regards,

Simon

2
2 votes

New · Last Updated

Comments

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,946 Neuron

    This does indeed look like cool idea but it will certainly be very compute intensive. Dataiku does a bit of this when you "Analyze" a column in the Explore view but this is done on a column by column basis which is not enough to determine if the data does have a potential unique key. You can also add a distinct value count for every column in the dataset Metrics ⇒ Columns statistics but again this would be shown per individual column so composite keys would not be found.

    I think your idea could be implemented as a plugin but I agree that the best option for this would be as a new GUI option given that the user should probably guide the key search with some parameters to reduce the potential computation required to get the output.

  • simon_aubert
    simon_aubert Dataiku DSS Core Designer, Registered Posts: 13 ✭✭✭

    Hello @Turribeach

    Yes, it will be compute intensive but if the user can select only columns that may be in the key and not all the columns, it will be more efficient. Moreover, this is not something you run everytime, it's more a one shot during data investigation.

    Thanks for your interest.

    Best regards,

    Simon

Setup Info
    Tags
      Help me…