Oversampling Dataset
UserBird
Dataiker, Alpha Tester Posts: 535 Dataiker
Is it somehow possible to oversample my dataset?
for example, I have such records and target variables
1 2 3 | 5
2 2 3 | 6
1 1 1 | 1
3 2 2 | 5
I want to duplicate (or generate more than one duplicate) row #3 and make my dataset looks as follows:
1 2 3 | 5
2 2 3 | 6
1 1 1 | 1
1 1 1 | 1
3 2 2 | 5
How can I do this?
Thank you in advance!
for example, I have such records and target variables
1 2 3 | 5
2 2 3 | 6
1 1 1 | 1
3 2 2 | 5
I want to duplicate (or generate more than one duplicate) row #3 and make my dataset looks as follows:
1 2 3 | 5
2 2 3 | 6
1 1 1 | 1
1 1 1 | 1
3 2 2 | 5
How can I do this?
Thank you in advance!
Tagged:
Answers
-
DSS does not have a builtin oversampling mechanism.
DSS has a "class rebalancing" sampling method. You could use it, either for the Explore / Prepare view, as dataset sampling in machine learning, or in a dedicated sampling recipe that will give you more balanced data.
However, this "class rebalancing" sampling method only undersamples, it never oversamples. It is also best suited for columns with reasonably low cardinality.
At the moment, if you want to oversample some rows, the best way would be to use a Python recipe (assuming your dataset fits in memory) or a PySpark/SparkR/Spark-Scala recipe else.