Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

How to sample an unbalanced dataset?

Solved!
UserBird
Dataiker
Dataiker
How to sample an unbalanced dataset?
I have a dataset too big to fit in memory, so I want to down sample it.
But the two classes to predict are unbalanced: there are many more 0's than 1's in the target column.
0 Kudos
1 Solution
jrouquie
Dataiker Alumni

We do not have a direct recipe to do so.

The fastest way is probably as a SQL recipe. For instance in Hive:

 




SELECT * FROM train_set WHERE target = 1
UNION ALL
SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo


 

View solution in original post

0 Kudos
1 Reply
jrouquie
Dataiker Alumni

We do not have a direct recipe to do so.

The fastest way is probably as a SQL recipe. For instance in Hive:

 




SELECT * FROM train_set WHERE target = 1
UNION ALL
SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo


 

0 Kudos