Stacking datasets with a condition

OlivierW · ‎01-08-2021

Hello,

I have been using a stacking recipe to group two datasets which have the same columns names and it's working fine.

But now, I would like to keep only certain lines of the second dataset : those for which the value in one of the column matches a value present in the same column of the first dataset.

In the example below, I would like to make a single dataset grouping records of shops A and B, but keeping for shop B only the customers who have been to shop A. I show below the two input datasets and the output I wish to obtain.

Does anyone knows how to use dataiku recipes to get this result?

Thanks in advance for your help!

Olivier

Dataset 1 :

Shop	Name	Date	Spending
A	Lana	07/03/2000	45
A	Lana	09/03/2000	32
A	Lana	02/04/2000	67
A	Steve	02/02/2000	37

Dataset 2

Shop	Name	Date	Spending
B	Joe	12/05/2000	15
B	Joe	17/06/2000	32
B	Joe	02/09/2000	5
B	Lana	05/01/2000	13
B	Lana	06/01/2000	2
B	Steve	09/03/2000	7
B	Steve	01/07/2000	62
B	Emy	12/05/2000	5
B	Emy	02/09/2000	3
B	Emy	05/10/2000	45
B	Emy	06/10/2000	65

Dataset grouping with condition

Shop	Name	Date	Spending
A	Lana	07/03/2000	45
A	Lana	09/03/2000	32
A	Lana	02/04/2000	67
A	Steve	02/02/2000	37
B	Lana	05/01/2000	13
B	Lana	06/01/2000	2
B	Steve	09/03/2000	7
B	Steve	01/07/2000	62

tgb417 · ‎01-08-2021

@OlivierW

This feels like something that can be done with the stack recipe and 2 Join Recipe.

1. find the folks who visited both shops

First, you would join each of the datasets on the customer name to find folks who have been to both shops. This would be done with an inner join.

Join Dialog box highlighting the "Inner Join" join type.

Under selected columns keep only one of the customer name columns from one of the two data sets.

For this Join, you want to make it distinct.

Join Recipie Dialog box, showing Post Filters and the Distinct rows.

Then once you have that dataset.

2. Stack the data for both shops

This will give you the combined dataset with lots of rows you don't want.

3. Join the list of customers your want (step 1), to the stacked list of all your records (step 2)

We are now going to join the list of customers (who were at both shops) to the stack of records on the customer name.

This should also be an inner join. This type of join will drop all of the records that don't match the list of customer names who have been to both stores.

You will likely want to use the select column feature to include just the columns you care about.

Conclusion:

This approach can be generalized to many different kinds of filtering needs. Create a dataset with the unique values you need and use an inner join to pull the records you want from a dataset.

Hope that helps.

--Tom

View solution in original post

tgb417 · ‎01-08-2021

@OlivierW

This feels like something that can be done with the stack recipe and 2 Join Recipe.

1. find the folks who visited both shops

First, you would join each of the datasets on the customer name to find folks who have been to both shops. This would be done with an inner join.

Join Dialog box highlighting the "Inner Join" join type.

Under selected columns keep only one of the customer name columns from one of the two data sets.

For this Join, you want to make it distinct.

Join Recipie Dialog box, showing Post Filters and the Distinct rows.

Then once you have that dataset.

2. Stack the data for both shops

This will give you the combined dataset with lots of rows you don't want.

3. Join the list of customers your want (step 1), to the stacked list of all your records (step 2)

We are now going to join the list of customers (who were at both shops) to the stack of records on the customer name.

This should also be an inner join. This type of join will drop all of the records that don't match the list of customer names who have been to both stores.

You will likely want to use the select column feature to include just the columns you care about.

Conclusion:

This approach can be generalized to many different kinds of filtering needs. Create a dataset with the unique values you need and use an inner join to pull the records you want from a dataset.

Hope that helps.

--Tom

OlivierW · ‎01-11-2021

Hello Tom,

I tried your method, and it worked great!

Thanks for taking the time to post this detailed reply.

Olivier

Sign up to take part

Stacking datasets with a condition

Stacking datasets with a condition