Dummy/One-Hot Encode an Array/Set of Columns?

driscoll42
driscoll42 Registered Posts: 6

In my data I have two different types of data that I basically want to treat the same way. In one I have a column with array data, like:

ColumnA
[A,B]
[A]

[B,C]

I want to dummy encode these to make something like:

ColumnA_AColumnA_BColumnA_C
110
100
011

And then in another case I have a set of columns like:

ColumnAColumnBColumnC
A
AB
B
BCD

That ideally I'd like to merge together to make:

ABCD
1000
1100
0100
0111

The two scenarios I think are effectively the same (I could convert from one to the other easily enough), however I'm not sure the best way to do this. While I could write some python code to basically do this, in an ideal world, I wouldn't add a few hundred extra columns to my data. And I rather like in the models the showing that the ColumnA is 5% or whatever important. Sure I'd be breaking it up to show that ColumnA_A is probably 0.5%, but I don't want to be distributing that small if I don't have to.

Any suggestions on how to handle this?

Answers

  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker

    Hi,

    What you want to do seems achievable via a Prepare recipe, using the Unfold array processor

  • driscoll42
    driscoll42 Registered Posts: 6

    I think this will do exactly what I want, thank you. One question, do you know how it chooses the columns for the "Max nb. columns to create"? Based on the "Behavior when max is reached" options, it seems like it chooses the first n values and then drops/clips/warns on the rest. Ideally I'd keep the n most frequent, but I expect then I'll need to do some preprocessing?

Setup Info
    Tags
      Help me…