Is it possible to train a classification model with unbalanced data? (Text classification)

Solved!
taloh90
Level 2
Is it possible to train a classification model with unbalanced data? (Text classification)

I am doing text mining on pdf articles and I would like to do some classification. The articles are in English. But the problem is that my dataset is unbalanced, I have 24 articles for class 1 and only 5 articles for class 2. I can't get more than that as articles. So I'm in a bind.

I thought of doing oversampling but with so little data... One idea is to increase the size of class 2 by doing backward translation on the 5 items. That is to say, translate the articles into Chinese with Google translate or Deepl and translate them back into English. They won't be carbon copies because the semantics will have changed and I will have a more or less balanced dataset.

What do you think?

0 Kudos
1 Solution
tgb417

@taloh90 ,

I think that your intuition that you have a small data set is a wise intuition.

Given 29 Articles you could likely read all of the articles and classify the articles manually.  However, I'd like to celebrate your interest in modeling.  Can you share a little bit about the value of the model for you?  Is this about learning NLP?  Is this about finding similar articles in the future?  Is this about understanding different types of features that drive your classification?  With this kind of info as a community, we might be able to make a few more targeted suggestions.

When reading you post, One idea that came to mind for me is what about rather than looking at the whole article as a unit.  What about looking at the paragraph or even sentence as your unit of analysis.  Here is an example sentiment analysis done a few years back with a small number of longer documents; books by Jane Austin.  But the analysis is done at a sub-document level like a paragraph of a sentence.  Something like this might change the natue of your subject a bit

https://juliasilge.com/blog/if-i-loved-nlp-less/

https://www.visualthesaurus.com/cm/ll/pride-and-prejudice-and-natural-language-processing/

Just my $0.02.  I hope it helps.

 

--Tom

View solution in original post

0 Kudos
3 Replies
tgb417

@taloh90 ,

I think that your intuition that you have a small data set is a wise intuition.

Given 29 Articles you could likely read all of the articles and classify the articles manually.  However, I'd like to celebrate your interest in modeling.  Can you share a little bit about the value of the model for you?  Is this about learning NLP?  Is this about finding similar articles in the future?  Is this about understanding different types of features that drive your classification?  With this kind of info as a community, we might be able to make a few more targeted suggestions.

When reading you post, One idea that came to mind for me is what about rather than looking at the whole article as a unit.  What about looking at the paragraph or even sentence as your unit of analysis.  Here is an example sentiment analysis done a few years back with a small number of longer documents; books by Jane Austin.  But the analysis is done at a sub-document level like a paragraph of a sentence.  Something like this might change the natue of your subject a bit

https://juliasilge.com/blog/if-i-loved-nlp-less/

https://www.visualthesaurus.com/cm/ll/pride-and-prejudice-and-natural-language-processing/

Just my $0.02.  I hope it helps.

 

--Tom
0 Kudos
taloh90
Level 2
Author

It's a great idea. I'm ashamed I didn't think of it when it's so obvious in my case.

The aim is to learn NLP and to make a simple classification of these articles, to tell if the article is valid or not.

These articles were written based on Kolb's model (experiental learning), each article talks about a topic based on the 4 phases of the model (Concrete Experience, Reflective Observation, Abstract Conceptualization, Active Experimentation). These articles have been evaluated on 4 criteria that check whether the 4 phases are respected. For each criterion, the evaluation is done on a quantitative scale from 0 to 3 points, from 0 to 1 is insufficient while from 2 to 3 is sufficient. If 3 criteria are valid then the article is validated.


With the idea that you proposed, I will extract from each article 4 parts of texts that concern each phase of the model. I will rather make a classification on these criteria and not on the whole article.

This will give me a dataset with a size of 29 * 4, so 116 rows in total. By doing an analysis before, I will have 42 insufficient criteria and 74 sufficient criteria. This is a big change from before, thank you very much.

0 Kudos
tgb417

@taloh90 

Please let us know how you get on with your experiments.  Happy Data Sciencing....

--Tom
0 Kudos