Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

How to sample an unbalanced dataset?

Solved!
UserBird
Dataiker
Dataiker
How to sample an unbalanced dataset?
I have a dataset too big to fit in memory, so I want to down sample it.
But the two classes to predict are unbalanced: there are many more 0's than 1's in the target column.
0 Kudos
1 Solution
jrouquie
Dataiker Alumni

We do not have a direct recipe to do so.

The fastest way is probably as a SQL recipe. For instance in Hive:

 




SELECT * FROM train_set WHERE target = 1
UNION ALL
SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo


 

View solution in original post

0 Kudos
1 Reply
jrouquie
Dataiker Alumni

We do not have a direct recipe to do so.

The fastest way is probably as a SQL recipe. For instance in Hive:

 




SELECT * FROM train_set WHERE target = 1
UNION ALL
SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo


 

0 Kudos

Labels

?
Labels (2)
A banner prompting to get Dataiku