Find common cells in 2 different datasets

Leonardo
Leonardo Registered Posts: 14 ✭✭✭✭

Hi DSS,

I want to input 2 different datasets: dataset A of about 19.000 rows and 18 columns; and dataset B of about 4.000.000 rows and same columns as dataset A, that contains all row info from column A plus lots of additional info. All columns of A and B are the same, with column 1 containing patient codes.

I want to assign 1 in a new column (in the B dataset) for all column 1 cells of A that exist in column 1 of B; and assign 0 in the same new column for all cells that exist only in column 1 B.

Is it possible to do this with a visual recipe?

Leo


Operating system used: Windows

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
    Answer ✓

    @Leonardo

    in reading your note here, I’m wondering if the following approach might work for you.

    • use the visual join recipe to do a left join
      • Use the larger dataset as the left set.
      • Left join the smaller set as the so-called right set.
    • this will leave you with a bunch of rows that don’t have matches. So empty cells in a row where you would hope to have some values from the smaller dataset
    • then use a visual prepare recipe to create a column that looks for filled in and missing values from the second set.
      • if filled in you have your 1s
      • or if not you have your 0s.

    Note this approach does have some risk of duplicating rows overall in your final set. If you have more than 1 row in the smaller set that might match a record in larger set. If this is a possibility you might first use a distinct recipe on the smaller set to make sure that the items you are matching on are unique. This should resolve this issue.

    Have fun with your project. Let us all know how you are getting on with your analysis.

Answers

  • Leonardo
    Leonardo Registered Posts: 14 ✭✭✭✭

    That's a good idea Tom, thank you. When I tried to do this, it gives me an error though (attached). It looks like it does not like my date info? Sorry, I am new to DSS and visual recipes and not sure what's wrong there.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Leonardo

    When working with dates and times in DSS you need to parse the dates and times to a standard format. This is not done automatically like MS Excel might do. You can do this date and time parsing in a visual prepare recipe before you do your join.

    this means you will need two of these prepare recipes. One for the large dataset. One fir the small dataset.

    give it a try. Let the community know how you get on with this, so that others finding this thread might learn as well.

  • Leonardo
    Leonardo Registered Posts: 14 ✭✭✭✭

    @tgb417
    ,

    Thank you Tom. Parsing the dates worked fine. I will certainly post my final processing view on the topic to help others.

    I have one last issue though. I did the left join but, I don't see my columns from the second dataset being joined. I am not sure why. I show the process in the recipe and the columns of my first dataset on the left (there are no columns from the second dataset on the right).

    Leo

  • Leonardo
    Leonardo Registered Posts: 14 ✭✭✭✭

    Just to reply to myself, I noticed that I did not ask to generate all columns for the right dataset. After editing it, it worked.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Leonardo
    ,

    Thats great news! Congratulations.

    And thank you for sharing with other community members.

Setup Info
    Tags
      Help me…