Large number of classes

Arwamus0 Dataiku DSS Core Designer, Registered Posts: 13

I have a dataset of 244k records, I want to ingest it to a ML model in dataiku to predict a text column based on another.

The warning shows that a large number of classes has been detected, training may fall or performance will be poor, I have 286 unique values.

How can we overcome this? Also some of the algorithms are failing and the error message saying that process died( exit code:137, killed - maybe out of memory?)

What should be done to get sensible results?

Thanks in advance

Operating system used: Windos


  • LouisDHulst
    LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 44 Neuron

    Hi @Arwamus0

    Could you expand a bit more on your use case? What kind of text are you trying to predict? Is it just some simple labels, like "Red", "Blue", "Green", or is it something more complicated?

    Also, what kind of text are you training on, and what kind of pre-processing are you applying to the feature? You might be generating a ton of columns if you are applying the Count vectorization pre-processor, which can massively increase the size of your dataset and cause the out of memory error.

  • Arwamus0
    Arwamus0 Dataiku DSS Core Designer, Registered Posts: 13

    Hi @LouisDHulst

    My text column consists of codes with spaces between each 2 digits, ## ## ##.

    I'm predicting based on description keywords, so on text column, I normalized it and removes stopwords.

    How can I optimize this? Since it giving me we only can classify 50 classes.

Setup Info
      Help me…