Dummy/One-Hot Encode an Array/Set of Columns?
In my data I have two different types of data that I basically want to treat the same way. In one I have a column with array data, like:
ColumnA |
[A,B] |
[A] |
[B,C] |
I want to dummy encode these to make something like:
ColumnA_A | ColumnA_B | ColumnA_C |
1 | 1 | 0 |
1 | 0 | 0 |
0 | 1 | 1 |
And then in another case I have a set of columns like:
ColumnA | ColumnB | ColumnC |
A | ||
A | B | |
B | ||
B | C | D |
That ideally I'd like to merge together to make:
A | B | C | D |
1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 1 | 1 | 1 |
The two scenarios I think are effectively the same (I could convert from one to the other easily enough), however I'm not sure the best way to do this. While I could write some python code to basically do this, in an ideal world, I wouldn't add a few hundred extra columns to my data. And I rather like in the models the showing that the ColumnA is 5% or whatever important. Sure I'd be breaking it up to show that ColumnA_A is probably 0.5%, but I don't want to be distributing that small if I don't have to.
Any suggestions on how to handle this?
Answers
-
Hi,
What you want to do seems achievable via a Prepare recipe, using the Unfold array processor
-
I think this will do exactly what I want, thank you. One question, do you know how it chooses the columns for the "Max nb. columns to create"? Based on the "Behavior when max is reached" options, it seems like it chooses the first n values and then drops/clips/warns on the rest. Ideally I'd keep the n most frequent, but I expect then I'll need to do some preprocessing?