Unique key detector tool

simon_aubert · September 21

What's your use case?
More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.

What's your proposed solution?
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe.

How do I imagine that ?

Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows

Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".

ex :

order_id	line_id	amount	customer	site
1	1	100	A	U_250
1	2	12	A	U_250
1	3	45	A	U_250
2	1	75	A	U_250
2	2	12	A	U_250
3	1	15	B	U_250
4	1	45	B	U_251

The user will select every field but excluding Amount (he knows that Amount would have no sense in key)

The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields

to match the number of row (7)
And gives something like that

choice	number of fields	field combination
very good	2	order_id,line_id
average	3	order_id,line_id, customer
average	3	order_id,line_id, site
bad	4	order_id,line_id, site, customer
…	…	….

Are there any alternative solutions?
N/A

Best regards,

Simon

Turribeach · September 21

This does indeed look like cool idea but it will certainly be very compute intensive. Dataiku does a bit of this when you "Analyze" a column in the Explore view but this is done on a column by column basis which is not enough to determine if the data does have a potential unique key. You can also add a distinct value count for every column in the dataset Metrics ⇒ Columns statistics but again this would be shown per individual column so composite keys would not be found.

I think your idea could be implemented as a plugin but I agree that the best option for this would be as a new GUI option given that the user should probably guide the key search with some parameters to reduce the potential computation required to get the output.

simon_aubert · September 24

Hello @Turribeach

Yes, it will be compute intensive but if the user can select only columns that may be in the key and not all the columns, it will be more efficient. Moreover, this is not something you run everytime, it's more a one shot during data investigation.

Thanks for your interest.

Best regards,

Simon

Unique key detector tool

New · Last Updated September 21

Comments

Categories

Setup Info

Tags