How to split dataset based on the value of a column and define the number of the output datasets
Hello, I have a dataset with millions rows in the format as below, and want to split it into two datasets A and B. Is it possible to do by using the visual recipes of Dataiku (without coding)?
Dataset A: data with only "target_flg" = 1
Dataset B: data with only "target_flg" = 0, but instead of exporting all rows where "target_flg" = 0, I want to exact the data by random and ensure the numbers of rows of Dataset B is the exact 9 times of Dataset A.
(In the original dataset, the rows of "target_flg" = 0 is over 9 times of "target_flg" = 1)
target_flg | Col_A | Col_B | Col_C |
1 | AAA | BBB | DDD |
0 | THJ | JGG | DDS |
0 | EEE | DDD | JGG |
0 | WWD | JGG | DDD |
1 | DDD | DDD | BBB |
1 | EEE | BBB | EEE |
0 | BBB | JGG | AAA |
0 | AAA | AAA | JGG |
I've tried to use the Split Recipe but it seems there is no options to define the numbers of output datasets based on a particular dataset.
Ultimately, I want to join Dataset A and B together to build a ML model.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,909 Neuron
You can either split by percentage or you can split by value, but you can't combine the two together. Having said that you are saying that "In the original dataset, the rows of "target_flg" = 0 is over 9 times of "target_flg" = 1" so using a Split Recipe and splitting by value "target_flg" = 1 for Dataset A and "target_flg" = 0 for Dataset B will leave Dataset B being 9 times bigger than Dataset A, which is what you want right? So I don't really see what your problem is.
-
stephl Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 8 ✭
Thanks @Turribeach
.I would like the number of rows of Dataset B to be the EXACT 9 times of Dataset A...
It seems that is impossible without coding....?
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,909 Neuron
The Splitting method "Randomly dispatch data" allows you to set a ratio for the split. So if you set 10% for one dataset the other will be 90% which is 9 times bigger.
There might be a way to get your both requirements after all. If you select the Splitting method "Dispatch percentiles of sorted data" you can set a sort order and also a ratio for the split. So if you set the sort to be target_flg descending and set it to be 10% you will get 10% in Table A with target_flg = 1 and the remaining 90% will be in Table B. Make sure you create a new dataset for the remaining data (Output => Other dataset...).
-
stephl Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 8 ✭
Thanks so much, @Turribeach
I tried it out.
Since the "target_flg = 1" is not always exact the 10% of the whole dataset, with this method, it seems Dataset A inevitably contains "target_flg = 0"....
Or is there any way to split the data to meet this requirement when building the model?