Join us at the Everyday AI Conference in London, New York & Bengaluru! REGISTER NOW

AGE as dummy variable

Aslan
Level 2
AGE as dummy variable

Hello,

What is best way to convert AGE as dummy variable. As this sample.

AGEAGE_0_18AGE_19_25AGE_26_50AGE_50+
101000
200100
320010
450010
650001
0 Kudos
3 Replies
tgb417
Neuron
Neuron

@Aslan 

Welcome to the Dataiku Community.

In your example above, I'm not exactly clear what your starting data looks like.  And I'm not clear where you are trying to use your dummied values.

For now I'm going to assume that you have a single column AGE and it is some type of number INT or decimal to Start.  And I'm going to assume that you have determined the reason you want to use this specific grouping.

What I'd likely do is:

1. Use a Visual Prepare recipe.  In the visual recipe I'd use a formula step to create the categorical variables called something like Age_Range. In that Column I'd want to get the correct value for each of the values in the AGE column.  ("AGE_0_18", "AGE_19_25", "AGE_26_50", and "AGE_50+").  I'd likely use the formula functions " if ", " and ", and " coalesce " in the formula recipe.  The idea is to have a second new column called Age_Range, with the appropriate group description for each value in AGE.

AGEAge_Range
10AGE_0_18
20AGE_19_25
32AGE_26_50
45AGE_26_50
65AGE_50+

 

If I was going to bring that data directly into the Dataiku Visual ML tools.  I'd likely not unfold into multiple columns.  I'd likely let the visual ML recipe do the unfolding for me automatically in the ML model building process.  This is set under Feature Handling when you are working with Categorical Variables.   I'd do this because my reporting is likely to be better than if I'd pre-dummiefied my data.  It also saves me a step and database size.

If I needed this layout for some reason.

2. I'd add a second visual recipe step to "unfold" the categorical column "Age_Range" that we created in Step 1 above.

3. Then I'd delete the "Age_Range" column to get to your desired end state.

One additional hint that tripped me up when I started using the formula steps in Dataiku DSS.  The formulas are case sensitive.  " IF ", " If ",  " iF " will not work.  Only " if " (all lower case) will work.  (This is unlike how MS Excel work for example.)  So case matters.  Here is a formula that I recently wrote that is similar.

coalesce(
if( or(track == 1, track == 2), 'Platform 1-2','')
, if( or(track == 3, track == 4), 'Platform 3-4','')
, if( or(track == 5, track == 6), 'Platform 5-6','')
, if( or(track == 7, track == 8), 'Platform 7-8','')
, if( or(track == 9, track == 10), 'Platform 9-10','')
, if( or(track == 11, track == 12), 'Platform 11-12','')
, if( or(track == 13, track == 14), 'Platform 13-14','')
, if( or(track == 15, track == 16), 'Platform 15-16','')
, if( or(track == 17), 'Platform 17','')
, if( or(track == 18, track == 19), 'Platform 18-19','')
, if( or(track == 20, track == 21), 'Platform 20-21','')

)

 In your case you might try something more like this.

coalesce(
if( and(AGE >= 0, AGE < 19), 'AGE_0_18','')
, if( and(AGE >= 19, AGE < 26),'AGE_19_26','')
, if( and(AGE >= 26, AGE < 50),'AGE_26_50','')
, if( AGE > 50,'AGE_50+','')

)

 Have fun with your project.  Let us know how you are getting along with the project.

--Tom
Aslan
Level 2
Author

Thanks Tom for a detailed explanation

CoreyS
Community Manager
Community Manager

In this case age is really only a number 🤣

 

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!