Forcing a join to select FIRST entry when multiple entries exist

SeparateFiction
Level 2
Forcing a join to select FIRST entry when multiple entries exist

Hello - I'm looking for some more recipe help please!

I can't find a way within the visual recipe component of a join to force Dataiku to pick up the first entry where multiple entries exist.

An example is where I have an account number which is looking up to another table to determine the BUSINESS_DEFINITION. Now the table that I am looking up against might have multiple entries for the same account number with different business definitions. So a normal join will bring them all in as technically all of them are unique.

In my previous Access database, I am able to add the "FIRST" command to the select statement. This will only bring through the first business definition it finds in the lookup table. Example below:

SELECT
account.AccountID, 
account.Account_Category,
First(account.Bsns_Defn) As BSNS_DEFN

 

Given that I can't seem t find this option in the visual recipe component, I was hopnig that I could convert to SparkSQL and just pop in the First command - but this just won't work, it just keeps throwing errors about aggregating and windowing that don't resonate with me.

Any thoughts, ideas or suggestions on how I can force Dataiku to match the fist found record?

Thanks :O)

0 Kudos
4 Replies
JvK
Level 1

Hi,

to perform a "select FIRST entry when multiple entries exist" what I do is use an R code recipe and use the plyr library which has a lookup function with an option match="first", which stops on the first match.
e.g. : 

 

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
library(dataiku)

# Recipe inputs
table1 <- dkuReadDataset("table1", samplingMethod="full")
table2 <- dkuReadDataset("table2", samplingMethod="full")

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Compute recipe outputs
# TODO: Write here your actual code that computes the outputs
library(plyr)

## inner join
finaltable <- join(table1, table2, by = "lookup", type = "left", match = "first")

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Recipe outputs
dkuWriteDataset(finaltable")

 

 

jereze
Community Manager
Community Manager

Hi,

Using visual recipes, you can:

  • first, use the "distinct" recipe to get the value you want to keep in the join
  • then, use the "join" recipe
Jeremy, Product Manager at Dataiku
0 Kudos
SeparateFiction
Level 2
Author

I tried something similar, I think.

Distinct recipe whilst excluding the BUSINESS_DEFINITON column gives me my distinct account numbers, perfect.

There is no way to join back to the dataset with just one business definition per account though. A subsequent join brings back all the rows I had to start with.

Does this make sense? Maybe I need to try and draw something..

 

 

 

0 Kudos
tgb417

@SeparateFiction 

Several thoughts come to mind.

related to 

I can't find a way within the visual recipe component of a join to force Dataiku to pick up the first entry where multiple entries exist.

If you want to go completely visual recipe.  Let's break the problem down into several steps.

Finding the record we need.

You might try a window recipe to find the first record for each grouping.  In the example below, I'm using COVID-19 data and looking for the first row in each country ordered by date.  In this case, in Windows Definitions, you would.

Window Recipie.jpg

 

Then choose to add a column that counts from 1-n within each group on the aggregations' screen you want to add the Rank to each row in a partition group.  In this case, starting with the earliest data getting the rank 1.  The key in your case is that the #1 rank for each partition grouping should be the record you want to join on.

Window Part 2.jpg

This will give you a new rank column with the #1 being the first row in the group.

Showing the Rank2.jpg

 

 

Joining just the record you need.

Then in the Join Recipe, you can simply filter so that the Join works on only the #1 rows in each group.

Join Filter.jpg

There are other coding and maybe visual recipe ways to do this but I hope this helps.

--Tom

--Tom