Dummy/One-Hot Encode an Array/Set of Columns?

driscoll42
Level 2
Dummy/One-Hot Encode an Array/Set of Columns?

In my data I have two different types of data that I basically want to treat the same way. In one I have a column with array data, like:

ColumnA
[A,B]
[A]

[B,C]

 

I want to dummy encode these to make something like:

ColumnA_AColumnA_BColumnA_C
110
100
011

 

And then in another case I have a set of columns like:

ColumnAColumnBColumnC
A  
AB 
B  
BCD

 

That ideally I'd like to merge together to make:

ABCD
1000
1100
0100
0111

 

The two scenarios I think are effectively the same (I could convert from one to the other easily enough), however I'm not sure the best way to do this. While I could write some python code to basically do this, in an ideal world, I wouldn't add a few hundred extra columns to my data. And I rather like in the models the showing that the ColumnA is 5% or whatever important. Sure I'd be breaking it up to show that ColumnA_A is probably 0.5%, but I don't want to be distributing that small if I don't have to.

 

Any suggestions on how to handle this?

0 Kudos
2 Replies
AdrienL
Dataiker

Hi,

What you want to do seems achievable via a Prepare recipe, using the Unfold array processor

driscoll42
Level 2
Author

I think this will do exactly what I want, thank you. One question, do you know how it chooses the columns for the "Max nb. columns to create"? Based on the "Behavior when max is reached" options, it seems like it chooses the first n values and then drops/clips/warns on the rest. Ideally I'd keep the n most frequent, but I expect then I'll need to do some preprocessing?

0 Kudos