Large number of classes

Arwamus0
Level 2
Large number of classes

I have a dataset of 244k records, I want to ingest it to a ML model in dataiku to predict a text column based on another.

The warning shows that a large number of classes has been detected, training may fall or performance will be poor, I have 286 unique values.

How can we overcome this? Also some of the algorithms are failing and the error message saying that process died( exit code:137, killed - maybe out of memory?)

What should be done to get sensible results?

Thanks in advance


Operating system used: Windos

0 Kudos
2 Replies
LouisDHulst

Hi @Arwamus0 ,

Could you expand a bit more on your use case? What kind of text are you trying to predict? Is it just some simple labels, like "Red", "Blue", "Green", or is it something more complicated?

Also, what kind of text are you training on, and what kind of pre-processing are you applying to the feature? You might be generating a ton of columns if you are applying the Count vectorization pre-processor, which can massively increase the size of your dataset and cause the out of memory error.

 

 

 

 

0 Kudos
Arwamus0
Level 2
Author

Hi @LouisDHulst 

 

My text column consists of codes with spaces between each 2 digits, ## ## ##.

I'm predicting based on description keywords, so on text column, I normalized it and removes stopwords.

How can I optimize this? Since it giving me we only can classify 50 classes.

0 Kudos

Setup info

?
Tags (1)
A banner prompting to get Dataiku