Field grouping

simon_aubert · November 2024

Hello,

A lot of time, when you have a dataset, you want to know if there is a group of fields that works together. That can help to normalize (like de-joining) your data model for dataviz, performance issue or simplify your analysis.

Exemple

order_id	item_id	label	model_id	length	color	amount
1	1	A	10	15	Blue	101
2	1	A	10	15	Blue	101
3	2	B	10	15	Blue	101
4	2	B	10	15	Blue	101
5	2	B	10	15	Blue	101
6	3	C	20	25	Red	101
7	3	C	20	25	Red	101
8	3	C	20	25	Red	101
9	4	D	20	25	Red	101
10	4	D	20	25	Red	101
11	4	D	20	25	Red	101

Here, we could split the table in three :
-order

order_id	item_id	model_id	amount
1	1	10	101,2
2	1	10	103
3	2	10	104,8
4	2	10	106,6
5	2	10	108,4
6	3	20	110,2
7	3	20	112
8	3	20	113,8
9	4	20	115,6
10	4	20	117,4
11	4	20	119,2

-model

model_id	length	color
10	15	Blue
20	25	Red

-item

item_id	label
1	A
2	B
3	C
4	D

The tool would take :
-a dataframe in entry
-configuration : ability to select fields.
-output : a table with the recap of groups

field group	field	remaining fields
1	item_id	False
1	label	False
2	model_id	False
2	color	False
3	order_id	True
3	link to group 1	True
3	link to group 2	True
3	amount	True

Very important : the non-selected fields (like here, amount), are in the result but all in the "remaining" group.

Algo steps:
1/pre-groups : count distinct of each fields. goal : optimization of algo, to avoid to calculate all pairs
fields that has the same count distinct than the number of rows are automatically excluded and sent to the remaining group
fields that have have the same count distinct are set in the same pre-group

2/ for each group, for each pair of fields,
let's do a distinct of value of the pair
like here

item_id	label
1	A
2	B
3	C
4	D

if in this table, the count distinct of each field is equal to the number of rows, it's a "pair-group"

here, for the model, you will have
-model_id,length
-model_id,color
-length,color

3/Since a field can only belong to one group, it means model_id,length,color which would first (or second) group, then item_id and label

If a field does not belong to a group, he goes to "remaining group" at the end

in the remaining group, you can add a link to the other group since you don't know which field is the key.

field group	field	remaining fields
1	item_id	False
1	label	False
2	model_id	False
2	length	False
2	color	False
3	order_id	True
3	link to group 1	True
3	link to group 2	True
3	amount	True

Best regards,

Simon

PS : I have in mind an evolution with links between non-remaining table (like here, the model could be linked to the item as an option)

Field grouping

New · Last Updated November 2024

Categories

Setup Info

Tags