Feature handing for many unique variables

ben_p
ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

Hi everyone,

I have a feature in a model I am working on which has many unique values, DSS autoML solution initially toggled the feature off for this reason, but I am keen to see if it it could be a useful feature, as from a business perspective it's content is interesting.

Capture.PNG

What I would like to do is categorise only the most common values in the list, I looked in Feature Handling and came across the clipping options. My questions are:

  • Max nb. categories - will this keep the top x categories which appear most frequently, or just the first x identified?
  • Cumulative proportion - could I use this to retain features that appear in at least 10% of the data? I am not totally clear if this would give me the most frequent values, or if it's for other use cases?

Thanks for your help,

Ben

Answers

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

    As a followup question to the above, I tried some experiments with the settings and got very different results.

    For example when using the following settings:

    1.PNG

    I would expect here to only end up with 5 categories, but when I view a partial dependence plot in the model results I see:

    2.PNG

    Why am I seeing many more than 5 categories here?

    I also tested the cumulative proportions setting, to see what this showed in the results:

    3.PNG

    in the partial dependence plot I again see more categories than I was expecting, but the results are also very different:

    4.PNG

    In this example we see "Jackets" scored negatively with the first settings, but it is positive here.

    How can making such an apparently small change have such an impact of the results?

Setup Info
    Tags
      Help me…