Stacking datasets with a condition

OlivierW
OlivierW Registered Posts: 22 ✭✭✭✭

Hello,

I have been using a stacking recipe to group two datasets which have the same columns names and it's working fine.

But now, I would like to keep only certain lines of the second dataset : those for which the value in one of the column matches a value present in the same column of the first dataset.

In the example below, I would like to make a single dataset grouping records of shops A and B, but keeping for shop B only the customers who have been to shop A. I show below the two input datasets and the output I wish to obtain.

Does anyone knows how to use dataiku recipes to get this result?

Thanks in advance for your help!

Olivier

Dataset 1 :

ShopNameDateSpending
ALana07/03/200045
ALana09/03/200032
ALana02/04/200067
ASteve02/02/200037

Dataset 2

ShopNameDateSpending
BJoe12/05/200015
BJoe17/06/200032
BJoe02/09/20005
BLana05/01/200013
BLana06/01/20002
BSteve09/03/20007
BSteve01/07/200062
BEmy12/05/20005
BEmy02/09/20003
BEmy05/10/200045
BEmy06/10/200065

Dataset grouping with condition

ShopNameDateSpending
ALana07/03/200045
ALana09/03/200032
ALana02/04/200067
ASteve02/02/200037
BLana05/01/200013
BLana06/01/20002
BSteve09/03/20007
BSteve01/07/200062

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
    Answer ✓

    @OlivierW

    This feels like something that can be done with the stack recipe and 2 Join Recipe.

    1. find the folks who visited both shops

    First, you would join each of the datasets on the customer name to find folks who have been to both shops. This would be done with an inner join.

    Inner Join Join type.jpg

    Under selected columns keep only one of the customer name columns from one of the two data sets.

    For this Join, you want to make it distinct.

    Distince Joins.jpg

    Then once you have that dataset.

    2. Stack the data for both shops

    This will give you the combined dataset with lots of rows you don't want.

    3. Join the list of customers your want (step 1), to the stacked list of all your records (step 2)

    We are now going to join the list of customers (who were at both shops) to the stack of records on the customer name.

    This should also be an inner join. This type of join will drop all of the records that don't match the list of customer names who have been to both stores.

    You will likely want to use the select column feature to include just the columns you care about.

    Conclusion:

    This approach can be generalized to many different kinds of filtering needs. Create a dataset with the unique values you need and use an inner join to pull the records you want from a dataset.

    Hope that helps.

Answers

Setup Info
    Tags
      Help me…