AGE as dummy variable
Hello,
What is best way to convert AGE as dummy variable. As this sample.
AGE | AGE_0_18 | AGE_19_25 | AGE_26_50 | AGE_50+ |
10 | 1 | 0 | 0 | 0 |
20 | 0 | 1 | 0 | 0 |
32 | 0 | 0 | 1 | 0 |
45 | 0 | 0 | 1 | 0 |
65 | 0 | 0 | 0 | 1 |
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Welcome to the Dataiku Community.
In your example above, I'm not exactly clear what your starting data looks like. And I'm not clear where you are trying to use your dummied values.
For now I'm going to assume that you have a single column AGE and it is some type of number INT or decimal to Start. And I'm going to assume that you have determined the reason you want to use this specific grouping.
What I'd likely do is:
1. Use a Visual Prepare recipe. In the visual recipe I'd use a formula step to create the categorical variables called something like Age_Range. In that Column I'd want to get the correct value for each of the values in the AGE column. ("AGE_0_18", "AGE_19_25", "AGE_26_50", and "AGE_50+"). I'd likely use the formula functions " if ", " and ", and " coalesce " in the formula recipe. The idea is to have a second new column called Age_Range, with the appropriate group description for each value in AGE.
AGE Age_Range 10 AGE_0_18 20 AGE_19_25 32 AGE_26_50 45 AGE_26_50 65 AGE_50+ If I was going to bring that data directly into the Dataiku Visual ML tools. I'd likely not unfold into multiple columns. I'd likely let the visual ML recipe do the unfolding for me automatically in the ML model building process. This is set under Feature Handling when you are working with Categorical Variables. I'd do this because my reporting is likely to be better than if I'd pre-dummiefied my data. It also saves me a step and database size.
If I needed this layout for some reason.
2. I'd add a second visual recipe step to "unfold" the categorical column "Age_Range" that we created in Step 1 above.
3. Then I'd delete the "Age_Range" column to get to your desired end state.
One additional hint that tripped me up when I started using the formula steps in Dataiku DSS. The formulas are case sensitive. " IF ", " If ", " iF " will not work. Only " if " (all lower case) will work. (This is unlike how MS Excel work for example.) So case matters. Here is a formula that I recently wrote that is similar.
coalesce(
if( or(track == 1, track == 2), 'Platform 1-2','')
, if( or(track == 3, track == 4), 'Platform 3-4','')
, if( or(track == 5, track == 6), 'Platform 5-6','')
, if( or(track == 7, track == 8), 'Platform 7-8','')
, if( or(track == 9, track == 10), 'Platform 9-10','')
, if( or(track == 11, track == 12), 'Platform 11-12','')
, if( or(track == 13, track == 14), 'Platform 13-14','')
, if( or(track == 15, track == 16), 'Platform 15-16','')
, if( or(track == 17), 'Platform 17','')
, if( or(track == 18, track == 19), 'Platform 18-19','')
, if( or(track == 20, track == 21), 'Platform 20-21','')
)In your case you might try something more like this.
coalesce(
if( and(AGE >= 0, AGE < 19), 'AGE_0_18','')
, if( and(AGE >= 19, AGE < 26),'AGE_19_26','')
, if( and(AGE >= 26, AGE < 50),'AGE_26_50','')
, if( AGE > 50,'AGE_50+','')
)Have fun with your project. Let us know how you are getting along with the project.
-
Thanks Tom for a detailed explanation
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
In this case age is really only a number