AGE as dummy variable

Aslan
Aslan Dataiku DSS Core Designer, Registered Posts: 6 ✭✭✭

Hello,

What is best way to convert AGE as dummy variable. As this sample.

AGEAGE_0_18AGE_19_25AGE_26_50AGE_50+
101000
200100
320010
450010
650001

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
    edited July 17

    @Aslan

    Welcome to the Dataiku Community.

    In your example above, I'm not exactly clear what your starting data looks like. And I'm not clear where you are trying to use your dummied values.

    For now I'm going to assume that you have a single column AGE and it is some type of number INT or decimal to Start. And I'm going to assume that you have determined the reason you want to use this specific grouping.

    What I'd likely do is:

    1. Use a Visual Prepare recipe. In the visual recipe I'd use a formula step to create the categorical variables called something like Age_Range. In that Column I'd want to get the correct value for each of the values in the AGE column. ("AGE_0_18", "AGE_19_25", "AGE_26_50", and "AGE_50+"). I'd likely use the formula functions " if ", " and ", and " coalesce " in the formula recipe. The idea is to have a second new column called Age_Range, with the appropriate group description for each value in AGE.

    AGEAge_Range
    10AGE_0_18
    20AGE_19_25
    32AGE_26_50
    45AGE_26_50
    65AGE_50+

    If I was going to bring that data directly into the Dataiku Visual ML tools. I'd likely not unfold into multiple columns. I'd likely let the visual ML recipe do the unfolding for me automatically in the ML model building process. This is set under Feature Handling when you are working with Categorical Variables. I'd do this because my reporting is likely to be better than if I'd pre-dummiefied my data. It also saves me a step and database size.

    If I needed this layout for some reason.

    2. I'd add a second visual recipe step to "unfold" the categorical column "Age_Range" that we created in Step 1 above.

    3. Then I'd delete the "Age_Range" column to get to your desired end state.

    One additional hint that tripped me up when I started using the formula steps in Dataiku DSS. The formulas are case sensitive. " IF ", " If ", " iF " will not work. Only " if " (all lower case) will work. (This is unlike how MS Excel work for example.) So case matters. Here is a formula that I recently wrote that is similar.

    coalesce(
    if( or(track == 1, track == 2), 'Platform 1-2','')
    , if( or(track == 3, track == 4), 'Platform 3-4','')
    , if( or(track == 5, track == 6), 'Platform 5-6','')
    , if( or(track == 7, track == 8), 'Platform 7-8','')
    , if( or(track == 9, track == 10), 'Platform 9-10','')
    , if( or(track == 11, track == 12), 'Platform 11-12','')
    , if( or(track == 13, track == 14), 'Platform 13-14','')
    , if( or(track == 15, track == 16), 'Platform 15-16','')
    , if( or(track == 17), 'Platform 17','')
    , if( or(track == 18, track == 19), 'Platform 18-19','')
    , if( or(track == 20, track == 21), 'Platform 20-21','')

    )

    In your case you might try something more like this.

    coalesce(
    if( and(AGE >= 0, AGE < 19), 'AGE_0_18','')
    , if( and(AGE >= 19, AGE < 26),'AGE_19_26','')
    , if( and(AGE >= 26, AGE < 50),'AGE_26_50','')
    , if( AGE > 50,'AGE_50+','')

    )

    Have fun with your project. Let us know how you are getting along with the project.

  • Aslan
    Aslan Dataiku DSS Core Designer, Registered Posts: 6 ✭✭✭

    Thanks Tom for a detailed explanation

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    In this case age is really only a number

Setup Info
    Tags
      Help me…