Ready for Dataiku 10? Try out the Crash Course on new features!GET STARTED

Find common cells in 2 different datasets

Solved!
Leonardo
Level 2
Find common cells in 2 different datasets

Hi DSS,

I want to input 2 different datasets: dataset A of about 19.000 rows and 18 columns; and dataset B of about 4.000.000 rows and same columns as dataset A, that contains all row info from column A plus lots of additional info. All columns of A and B are the same, with column 1 containing patient codes. 

I want to assign 1 in a new column (in the B dataset) for all column 1 cells of A that exist in column 1 of B; and assign 0 in the same new column for all cells that exist only in column 1 B.  

Is it possible to do this with a visual recipe?

Leo


Operating system used: Windows

0 Kudos
1 Solution
tgb417
Neuron
Neuron

@Leonardo 

in reading your note here, I’m wondering if the following approach might work for you.

  • use the visual join recipe to do a left join
    • Use the larger dataset as the left set.
    • Left join the smaller set as the so-called right set. 
  • this will leave you with a bunch of rows that don’t have matches.  So empty cells in a row where you would hope to have some values from the smaller dataset
  • then use a visual prepare recipe to create a column that looks for filled in and missing values from the second set.
    • if filled in you have your 1s
    • or if not you have your 0s.

Note this approach does have some risk of duplicating rows overall in your final set.  If you have more than 1 row in the smaller set that might match a record in larger set.  If this is a possibility you might first use a distinct recipe on the smaller set to make sure that the items you are matching on are unique.  This should resolve this issue.

Have fun with your project.  Let us all know how you are getting on with your analysis.   

--Tom

View solution in original post

6 Replies
tgb417
Neuron
Neuron

@Leonardo 

in reading your note here, I’m wondering if the following approach might work for you.

  • use the visual join recipe to do a left join
    • Use the larger dataset as the left set.
    • Left join the smaller set as the so-called right set. 
  • this will leave you with a bunch of rows that don’t have matches.  So empty cells in a row where you would hope to have some values from the smaller dataset
  • then use a visual prepare recipe to create a column that looks for filled in and missing values from the second set.
    • if filled in you have your 1s
    • or if not you have your 0s.

Note this approach does have some risk of duplicating rows overall in your final set.  If you have more than 1 row in the smaller set that might match a record in larger set.  If this is a possibility you might first use a distinct recipe on the smaller set to make sure that the items you are matching on are unique.  This should resolve this issue.

Have fun with your project.  Let us all know how you are getting on with your analysis.   

--Tom
Leonardo
Level 2
Author

That's a good idea Tom, thank you. When I tried to do this, it gives me an error though (attached). It looks like it does not like my date info? Sorry, I am new to DSS and visual recipes and not sure what's wrong there. 

0 Kudos
tgb417
Neuron
Neuron

@Leonardo 

When working with dates and times in DSS you need to parse the dates and times to a standard format.  This is not done automatically like MS Excel might do.  You can do this date and time parsing in a visual prepare recipe before you do your join.

this means you will need two of these prepare recipes.  One for the large dataset.  One fir the small dataset.   

give it a try.  Let the community know how you get on with this, so that others finding this thread might learn as well. 

--Tom
Leonardo
Level 2
Author

@tgb417 ,

Thank you Tom. Parsing the dates worked fine. I will certainly post my final processing view on the topic to help others. 

I have one last issue though. I did the left join but, I don't see my columns from the second dataset being joined. I am not sure why. I show the process in the recipe and the columns of my first dataset on the left (there are no columns from the second dataset on the right). 

Leo

0 Kudos
Leonardo
Level 2
Author

Just to reply to myself, I noticed that I did not ask to generate all columns for the right dataset. After editing it, it worked. 

tgb417
Neuron
Neuron

@Leonardo ,

Thats great news!  Congratulations.

And thank you for sharing with other community members.

 

--Tom
0 Kudos

Setup info

?
A banner prompting to get Dataiku DSS