Stacking datasets with a condition
Hello,
I have been using a stacking recipe to group two datasets which have the same columns names and it's working fine.
But now, I would like to keep only certain lines of the second dataset : those for which the value in one of the column matches a value present in the same column of the first dataset.
In the example below, I would like to make a single dataset grouping records of shops A and B, but keeping for shop B only the customers who have been to shop A. I show below the two input datasets and the output I wish to obtain.
Does anyone knows how to use dataiku recipes to get this result?
Thanks in advance for your help!
Olivier
Dataset 1 :
Shop | Name | Date | Spending |
A | Lana | 07/03/2000 | 45 |
A | Lana | 09/03/2000 | 32 |
A | Lana | 02/04/2000 | 67 |
A | Steve | 02/02/2000 | 37 |
Dataset 2
Shop | Name | Date | Spending |
B | Joe | 12/05/2000 | 15 |
B | Joe | 17/06/2000 | 32 |
B | Joe | 02/09/2000 | 5 |
B | Lana | 05/01/2000 | 13 |
B | Lana | 06/01/2000 | 2 |
B | Steve | 09/03/2000 | 7 |
B | Steve | 01/07/2000 | 62 |
B | Emy | 12/05/2000 | 5 |
B | Emy | 02/09/2000 | 3 |
B | Emy | 05/10/2000 | 45 |
B | Emy | 06/10/2000 | 65 |
Dataset grouping with condition
Shop | Name | Date | Spending |
A | Lana | 07/03/2000 | 45 |
A | Lana | 09/03/2000 | 32 |
A | Lana | 02/04/2000 | 67 |
A | Steve | 02/02/2000 | 37 |
B | Lana | 05/01/2000 | 13 |
B | Lana | 06/01/2000 | 2 |
B | Steve | 09/03/2000 | 7 |
B | Steve | 01/07/2000 | 62 |
Best Answer
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
This feels like something that can be done with the stack recipe and 2 Join Recipe.
1. find the folks who visited both shops
First, you would join each of the datasets on the customer name to find folks who have been to both shops. This would be done with an inner join.
Under selected columns keep only one of the customer name columns from one of the two data sets.
For this Join, you want to make it distinct.
Then once you have that dataset.
2. Stack the data for both shops
This will give you the combined dataset with lots of rows you don't want.
3. Join the list of customers your want (step 1), to the stacked list of all your records (step 2)
We are now going to join the list of customers (who were at both shops) to the stack of records on the customer name.
This should also be an inner join. This type of join will drop all of the records that don't match the list of customer names who have been to both stores.
You will likely want to use the select column feature to include just the columns you care about.
Conclusion:
This approach can be generalized to many different kinds of filtering needs. Create a dataset with the unique values you need and use an inner join to pull the records you want from a dataset.
Hope that helps.
Answers
-
Hello Tom,
I tried your method, and it worked great!
Thanks for taking the time to post this detailed reply.
Olivier