Forcing a join to select FIRST entry when multiple entries exist

SeparateFiction
SeparateFiction Registered Posts: 7 ✭✭✭✭

Hello - I'm looking for some more recipe help please!

I can't find a way within the visual recipe component of a join to force Dataiku to pick up the first entry where multiple entries exist.

An example is where I have an account number which is looking up to another table to determine the BUSINESS_DEFINITION. Now the table that I am looking up against might have multiple entries for the same account number with different business definitions. So a normal join will bring them all in as technically all of them are unique.

In my previous Access database, I am able to add the "FIRST" command to the select statement. This will only bring through the first business definition it finds in the lookup table. Example below:

SELECT
account.AccountID,
account.Account_Category,
First(account.Bsns_Defn) As BSNS_DEFN

Given that I can't seem t find this option in the visual recipe component, I was hopnig that I could convert to SparkSQL and just pop in the First command - but this just won't work, it just keeps throwing errors about aggregating and windowing that don't resonate with me.

Any thoughts, ideas or suggestions on how I can force Dataiku to match the fist found record?

Thanks :O)

Answers

  • JvK
    JvK Registered Posts: 1 ✭✭✭✭
    edited July 17

    Hi,

    to perform a "select FIRST entry when multiple entries exist" what I do is use an R code recipe and use the plyr library which has a lookup function with an option match="first", which stops on the first match.
    e.g. :

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    library(dataiku)
    
    # Recipe inputs
    table1 <- dkuReadDataset("table1", samplingMethod="full")
    table2 <- dkuReadDataset("table2", samplingMethod="full")
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Compute recipe outputs
    # TODO: Write here your actual code that computes the outputs
    library(plyr)
    
    ## inner join
    finaltable <- join(table1, table2, by = "lookup", type = "left", match = "first")
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Recipe outputs
    dkuWriteDataset(finaltable")

  • jereze
    jereze Alpha Tester, Dataiker Alumni Posts: 190 ✭✭✭✭✭✭✭✭

    Hi,

    Using visual recipes, you can:

    • first, use the "distinct" recipe to get the value you want to keep in the join
    • then, use the "join" recipe
  • SeparateFiction
    SeparateFiction Registered Posts: 7 ✭✭✭✭

    I tried something similar, I think.

    Distinct recipe whilst excluding the BUSINESS_DEFINITON column gives me my distinct account numbers, perfect.

    There is no way to join back to the dataset with just one business definition per account though. A subsequent join brings back all the rows I had to start with.

    Does this make sense? Maybe I need to try and draw something..

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @SeparateFiction

    Several thoughts come to mind.

    related to

    I can't find a way within the visual recipe component of a join to force Dataiku to pick up the first entry where multiple entries exist.

    If you want to go completely visual recipe. Let's break the problem down into several steps.

    Finding the record we need.

    You might try a window recipe to find the first record for each grouping. In the example below, I'm using COVID-19 data and looking for the first row in each country ordered by date. In this case, in Windows Definitions, you would.

    Window Recipie.jpg

    Then choose to add a column that counts from 1-n within each group on the aggregations' screen you want to add the Rank to each row in a partition group. In this case, starting with the earliest data getting the rank 1. The key in your case is that the #1 rank for each partition grouping should be the record you want to join on.

    Window Part 2.jpg

    This will give you a new rank column with the #1 being the first row in the group.

    Showing the Rank2.jpg

    Joining just the record you need.

    Then in the Join Recipe, you can simply filter so that the Join works on only the #1 rows in each group.

    Join Filter.jpg

    There are other coding and maybe visual recipe ways to do this but I hope this helps.

    --Tom

Setup Info
    Tags
      Help me…