How to sample an unbalanced dataset?

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
I have a dataset too big to fit in memory, so I want to down sample it.
But the two classes to predict are unbalanced: there are many more 0's than 1's in the target column.
Tagged:

Best Answer

  • jrouquie
    jrouquie Dataiker Alumni Posts: 87 ✭✭✭✭✭✭✭
    edited July 17 Answer ✓

    We do not have a direct recipe to do so.

    The fastest way is probably as a SQL recipe. For instance in Hive:


    SELECT * FROM train_set WHERE target = 1
    UNION ALL
    SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo

Setup Info
    Tags
      Help me…