Stacking datasets with a condition

Solved!
OlivierW
Level 3
Stacking datasets with a condition

Hello,

I have been using a stacking recipe to group two datasets which have the same columns names and it's working fine.

But now, I would like to keep only certain lines of the second dataset : those for which the value in one of the column matches a value present in the same column of the first dataset.

In the example below, I would like to make a single dataset grouping records of shops A and B, but keeping for shop B only the customers who have been to shop A. I show below the two input datasets and the output I wish to obtain.

Does anyone knows how to use dataiku recipes to get this result?

Thanks in advance for your help!

Olivier

Dataset 1 :

ShopNameDateSpending
ALana07/03/200045
ALana09/03/200032
ALana02/04/200067
ASteve02/02/200037

 

Dataset 2

ShopNameDateSpending
BJoe12/05/200015
BJoe17/06/200032
BJoe02/09/20005
BLana05/01/200013
BLana06/01/20002
BSteve09/03/20007
BSteve01/07/200062
BEmy12/05/20005
BEmy02/09/20003
BEmy05/10/200045
BEmy06/10/200065

 

Dataset grouping with condition

ShopNameDateSpending
ALana07/03/200045
ALana09/03/200032
ALana02/04/200067
ASteve02/02/200037
BLana05/01/200013
BLana06/01/20002
BSteve09/03/20007
BSteve01/07/200062
0 Kudos
1 Solution
tgb417

@OlivierW 

This feels like something that can be done with the stack recipe and 2 Join Recipe.

1. find the folks who visited both shops

First, you would join each of the datasets on the customer name to find folks who have been to both shops.  This would be done with an inner join.

Join Dialog box highlighting the "Inner Join" join type.Join Dialog box highlighting the "Inner Join" join type.

Under selected columns keep only one of the customer name columns from one of the two data sets.

For this Join, you want to make it distinct.

Join Recipie Dialog box, showing Post Filters and the Distinct rows.Join Recipie Dialog box, showing Post Filters and the Distinct rows.

 Then once you have that dataset.

2. Stack the data for both shops

This will give you the combined dataset with lots of rows you don't want.

3. Join the list of customers your want (step 1), to the stacked list of all your records (step 2)

We are now going to join the list of customers (who were at both shops) to the stack of records on the customer name.

This should also be an inner join.  This type of join will drop all of the records that don't match the list of customer names who have been to both stores.  

You will likely want to use the select column feature to include just the columns you care about.

Conclusion:

This approach can be generalized to many different kinds of filtering needs.  Create a dataset with the unique values you need and use an inner join to pull the records you want from a dataset.

Hope that helps.

--Tom

View solution in original post

2 Replies
tgb417

@OlivierW 

This feels like something that can be done with the stack recipe and 2 Join Recipe.

1. find the folks who visited both shops

First, you would join each of the datasets on the customer name to find folks who have been to both shops.  This would be done with an inner join.

Join Dialog box highlighting the "Inner Join" join type.Join Dialog box highlighting the "Inner Join" join type.

Under selected columns keep only one of the customer name columns from one of the two data sets.

For this Join, you want to make it distinct.

Join Recipie Dialog box, showing Post Filters and the Distinct rows.Join Recipie Dialog box, showing Post Filters and the Distinct rows.

 Then once you have that dataset.

2. Stack the data for both shops

This will give you the combined dataset with lots of rows you don't want.

3. Join the list of customers your want (step 1), to the stacked list of all your records (step 2)

We are now going to join the list of customers (who were at both shops) to the stack of records on the customer name.

This should also be an inner join.  This type of join will drop all of the records that don't match the list of customer names who have been to both stores.  

You will likely want to use the select column feature to include just the columns you care about.

Conclusion:

This approach can be generalized to many different kinds of filtering needs.  Create a dataset with the unique values you need and use an inner join to pull the records you want from a dataset.

Hope that helps.

--Tom
OlivierW
Level 3
Author

Hello Tom,

I tried your method, and it worked great!

Thanks for taking the time to post this detailed reply.

Olivier