Dummy/One-Hot Encode an Array/Set of Columns?

driscoll42 · ‎01-20-2024

In my data I have two different types of data that I basically want to treat the same way. In one I have a column with array data, like:

ColumnA

[A,B]

[A]

[B,C]

I want to dummy encode these to make something like:

ColumnA_A	ColumnA_B	ColumnA_C
1	1	0
1	0	0
0	1	1

And then in another case I have a set of columns like:

ColumnA	ColumnB	ColumnC
A
A	B
B
B	C	D

That ideally I'd like to merge together to make:

A	B	C	D
1	0	0	0
1	1	0	0
0	1	0	0
0	1	1	1

The two scenarios I think are effectively the same (I could convert from one to the other easily enough), however I'm not sure the best way to do this. While I could write some python code to basically do this, in an ideal world, I wouldn't add a few hundred extra columns to my data. And I rather like in the models the showing that the ColumnA is 5% or whatever important. Sure I'd be breaking it up to show that ColumnA_A is probably 0.5%, but I don't want to be distributing that small if I don't have to.

Any suggestions on how to handle this?

AdrienL · ‎01-23-2024

Hi,

What you want to do seems achievable via a Prepare recipe, using the Unfold array processor

driscoll42 · ‎01-25-2024

I think this will do exactly what I want, thank you. One question, do you know how it chooses the columns for the "Max nb. columns to create"? Based on the "Behavior when max is reached" options, it seems like it chooses the first n values and then drops/clips/warns on the rest. Ideally I'd keep the n most frequent, but I expect then I'll need to do some preprocessing?

Sign up to take part

Dummy/One-Hot Encode an Array/Set of Columns?

Dummy/One-Hot Encode an Array/Set of Columns?