Unique key detector tool
What's your use case?
More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.
What's your proposed solution?
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe.
How do I imagine that ?
Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows
Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".
ex :
order_id | line_id | amount | customer | site |
---|---|---|---|---|
1 | 1 | 100 | A | U_250 |
1 | 2 | 12 | A | U_250 |
1 | 3 | 45 | A | U_250 |
2 | 1 | 75 | A | U_250 |
2 | 2 | 12 | A | U_250 |
3 | 1 | 15 | B | U_250 |
4 | 1 | 45 | B | U_251 |
The user will select every field but excluding Amount (he knows that Amount would have no sense in key)
The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields
to match the number of row (7)
And gives something like that
choice | number of fields | field combination |
---|---|---|
very good | 2 | order_id,line_id |
average | 3 | order_id,line_id, customer |
average | 3 | order_id,line_id, site |
bad | 4 | order_id,line_id, site, customer |
… | … | …. |
Are there any alternative solutions?
N/A
Best regards,
Simon
Comments
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
This does indeed look like cool idea but it will certainly be very compute intensive. Dataiku does a bit of this when you "Analyze" a column in the Explore view but this is done on a column by column basis which is not enough to determine if the data does have a potential unique key. You can also add a distinct value count for every column in the dataset Metrics ⇒ Columns statistics but again this would be shown per individual column so composite keys would not be found.
I think your idea could be implemented as a plugin but I agree that the best option for this would be as a new GUI option given that the user should probably guide the key search with some parameters to reduce the potential computation required to get the output.
-
Hello @Turribeach
Yes, it will be compute intensive but if the user can select only columns that may be in the key and not all the columns, it will be more efficient. Moreover, this is not something you run everytime, it's more a one shot during data investigation.
Thanks for your interest.
Best regards,
Simon