Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello,
What is best way to convert AGE as dummy variable. As this sample.
AGE | AGE_0_18 | AGE_19_25 | AGE_26_50 | AGE_50+ |
10 | 1 | 0 | 0 | 0 |
20 | 0 | 1 | 0 | 0 |
32 | 0 | 0 | 1 | 0 |
45 | 0 | 0 | 1 | 0 |
65 | 0 | 0 | 0 | 1 |
Welcome to the Dataiku Community.
In your example above, I'm not exactly clear what your starting data looks like. And I'm not clear where you are trying to use your dummied values.
For now I'm going to assume that you have a single column AGE and it is some type of number INT or decimal to Start. And I'm going to assume that you have determined the reason you want to use this specific grouping.
What I'd likely do is:
1. Use a Visual Prepare recipe. In the visual recipe I'd use a formula step to create the categorical variables called something like Age_Range. In that Column I'd want to get the correct value for each of the values in the AGE column. ("AGE_0_18", "AGE_19_25", "AGE_26_50", and "AGE_50+"). I'd likely use the formula functions " if ", " and ", and " coalesce " in the formula recipe. The idea is to have a second new column called Age_Range, with the appropriate group description for each value in AGE.
AGE | Age_Range |
10 | AGE_0_18 |
20 | AGE_19_25 |
32 | AGE_26_50 |
45 | AGE_26_50 |
65 | AGE_50+ |
If I was going to bring that data directly into the Dataiku Visual ML tools. I'd likely not unfold into multiple columns. I'd likely let the visual ML recipe do the unfolding for me automatically in the ML model building process. This is set under Feature Handling when you are working with Categorical Variables. I'd do this because my reporting is likely to be better than if I'd pre-dummiefied my data. It also saves me a step and database size.
If I needed this layout for some reason.
2. I'd add a second visual recipe step to "unfold" the categorical column "Age_Range" that we created in Step 1 above.
3. Then I'd delete the "Age_Range" column to get to your desired end state.
One additional hint that tripped me up when I started using the formula steps in Dataiku DSS. The formulas are case sensitive. " IF ", " If ", " iF " will not work. Only " if " (all lower case) will work. (This is unlike how MS Excel work for example.) So case matters. Here is a formula that I recently wrote that is similar.
coalesce(
if( or(track == 1, track == 2), 'Platform 1-2','')
, if( or(track == 3, track == 4), 'Platform 3-4','')
, if( or(track == 5, track == 6), 'Platform 5-6','')
, if( or(track == 7, track == 8), 'Platform 7-8','')
, if( or(track == 9, track == 10), 'Platform 9-10','')
, if( or(track == 11, track == 12), 'Platform 11-12','')
, if( or(track == 13, track == 14), 'Platform 13-14','')
, if( or(track == 15, track == 16), 'Platform 15-16','')
, if( or(track == 17), 'Platform 17','')
, if( or(track == 18, track == 19), 'Platform 18-19','')
, if( or(track == 20, track == 21), 'Platform 20-21','')
)
In your case you might try something more like this.
coalesce(
if( and(AGE >= 0, AGE < 19), 'AGE_0_18','')
, if( and(AGE >= 19, AGE < 26),'AGE_19_26','')
, if( and(AGE >= 26, AGE < 50),'AGE_26_50','')
, if( AGE > 50,'AGE_50+','')
)
Have fun with your project. Let us know how you are getting along with the project.
Thanks Tom for a detailed explanation
In this case age is really only a number ๐คฃ