How to sample an unbalanced dataset?

Dataiker, Alpha Tester Posts: 535 Dataiker
I have a dataset too big to fit in memory, so I want to down sample it.
But the two classes to predict are unbalanced: there are many more 0's than 1's in the target column.

Best Answer

  • Dataiker Alumni Posts: 87 ✭✭✭✭✭✭✭
    edited July 2024 Answer ✓

    We do not have a direct recipe to do so.

    The fastest way is probably as a SQL recipe. For instance in Hive:


    SELECT * FROM train_set WHERE target = 1
    UNION ALL
    SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.