Field grouping

simon_aubert
simon_aubert Dataiku DSS Core Designer, Registered Posts: 15 ✭✭✭

Hello,

A lot of time, when you have a dataset, you want to know if there is a group of fields that works together. That can help to normalize (like de-joining) your data model for dataviz, performance issue or simplify your analysis.

Exemple

<style></style>

order_id

item_id

label

model_id

length

color

amount

1

1

A

10

15

Blue

101

2

1

A

10

15

Blue

101

3

2

B

10

15

Blue

101

4

2

B

10

15

Blue

101

5

2

B

10

15

Blue

101

6

3

C

20

25

Red

101

7

3

C

20

25

Red

101

8

3

C

20

25

Red

101

9

4

D

20

25

Red

101

10

4

D

20

25

Red

101

11

4

D

20

25

Red

101

Here, we could split the table in three :
-order

<style></style>

order_id

item_id

model_id

amount

1

1

10

101,2

2

1

10

103

3

2

10

104,8

4

2

10

106,6

5

2

10

108,4

6

3

20

110,2

7

3

20

112

8

3

20

113,8

9

4

20

115,6

10

4

20

117,4

11

4

20

119,2

-model

<style></style>

model_id

length

color

10

15

Blue

20

25

Red

-item

<style></style>

item_id

label

1

A

2

B

3

C

4

D

The tool would take :
-a dataframe in entry
-configuration : ability to select fields.
-output : a table with the recap of groups

<style></style>

field group

field

remaining fields

1

item_id

False

1

label

False

2

model_id

False

2

color

False

3

order_id

True

3

link to group 1

True

3

link to group 2

True

3

amount

True

Very important : the non-selected fields (like here, amount), are in the result but all in the "remaining" group.

Algo steps:
1/pre-groups : count distinct of each fields. goal : optimization of algo, to avoid to calculate all pairs
fields that has the same count distinct than the number of rows are automatically excluded and sent to the remaining group
fields that have have the same count distinct are set in the same pre-group

2/ for each group, for each pair of fields,
let's do a distinct of value of the pair
like here

<style></style>

item_id

label

1

A

2

B

3

C

4

D

if in this table, the count distinct of each field is equal to the number of rows, it's a "pair-group"

here, for the model, you will have
-model_id,length
-model_id,color
-length,color

3/Since a field can only belong to one group, it means model_id,length,color which would first (or second) group, then item_id and label

If a field does not belong to a group, he goes to "remaining group" at the end

in the remaining group, you can add a link to the other group since you don't know which field is the key.

<style></style>

field group

field

remaining fields

1

item_id

False

1

label

False

2

model_id

False

2

length

False

2

color

False

3

order_id

True

3

link to group 1

True

3

link to group 2

True

3

amount

True

Best regards,

Simon

PS : I have in mind an evolution with links between non-remaining table (like here, the model could be linked to the item as an option)

1
1 votes

New · Last Updated

Setup Info
    Tags
      Help me…