How to sample an unbalanced dataset?

Solved!
UserBird
Dataiker
How to sample an unbalanced dataset?
I have a dataset too big to fit in memory, so I want to down sample it.
But the two classes to predict are unbalanced: there are many more 0's than 1's in the target column.
0 Kudos
1 Solution
jrouquie
Dataiker Alumni

We do not have a direct recipe to do so.

The fastest way is probably as a SQL recipe. For instance in Hive:

 




SELECT * FROM train_set WHERE target = 1
UNION ALL
SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo


 

View solution in original post

0 Kudos
1 Reply
jrouquie
Dataiker Alumni

We do not have a direct recipe to do so.

The fastest way is probably as a SQL recipe. For instance in Hive:

 




SELECT * FROM train_set WHERE target = 1
UNION ALL
SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo


 

0 Kudos