Is it possible to train a classification model with unbalanced data? (Text classification)

Options
taloh90
taloh90 Registered Posts: 6 ✭✭✭

I am doing text mining on pdf articles and I would like to do some classification. The articles are in English. But the problem is that my dataset is unbalanced, I have 24 articles for class 1 and only 5 articles for class 2. I can't get more than that as articles. So I'm in a bind.

I thought of doing oversampling but with so little data... One idea is to increase the size of class 2 by doing backward translation on the 5 items. That is to say, translate the articles into Chinese with Google translate or Deepl and translate them back into English. They won't be carbon copies because the semantics will have changed and I will have a more or less balanced dataset.

What do you think?

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Answer ✓
    Options

    @taloh90
    ,

    I think that your intuition that you have a small data set is a wise intuition.

    Given 29 Articles you could likely read all of the articles and classify the articles manually. However, I'd like to celebrate your interest in modeling. Can you share a little bit about the value of the model for you? Is this about learning NLP? Is this about finding similar articles in the future? Is this about understanding different types of features that drive your classification? With this kind of info as a community, we might be able to make a few more targeted suggestions.

    When reading you post, One idea that came to mind for me is what about rather than looking at the whole article as a unit. What about looking at the paragraph or even sentence as your unit of analysis. Here is an example sentiment analysis done a few years back with a small number of longer documents; books by Jane Austin. But the analysis is done at a sub-document level like a paragraph of a sentence. Something like this might change the natue of your subject a bit

    https://juliasilge.com/blog/if-i-loved-nlp-less/

    https://www.visualthesaurus.com/cm/ll/pride-and-prejudice-and-natural-language-processing/

    Just my $0.02. I hope it helps.

Answers

  • taloh90
    taloh90 Registered Posts: 6 ✭✭✭
    Options

    It's a great idea. I'm ashamed I didn't think of it when it's so obvious in my case.

    The aim is to learn NLP and to make a simple classification of these articles, to tell if the article is valid or not.

    These articles were written based on Kolb's model (experiental learning), each article talks about a topic based on the 4 phases of the model (Concrete Experience, Reflective Observation, Abstract Conceptualization, Active Experimentation). These articles have been evaluated on 4 criteria that check whether the 4 phases are respected. For each criterion, the evaluation is done on a quantitative scale from 0 to 3 points, from 0 to 1 is insufficient while from 2 to 3 is sufficient. If 3 criteria are valid then the article is validated.


    With the idea that you proposed, I will extract from each article 4 parts of texts that concern each phase of the model. I will rather make a classification on these criteria and not on the whole article.

    This will give me a dataset with a size of 29 * 4, so 116 rows in total. By doing an analysis before, I will have 42 insufficient criteria and 74 sufficient criteria. This is a big change from before, thank you very much.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @taloh90

    Please let us know how you get on with your experiments. Happy Data Sciencing....

Setup Info
    Tags
      Help me…