How to calculate day difference between rows with specific conditions

stephl
stephl Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 8

Hello, I have a dataset like below and want to calculate the "CV" day difference since "Visit" per "ID", as the "CV_days_after_visit" in the expected result.

Is it possible to do by using the visual recipes of Dataiku (without coding)?

Existing dataset:

IDActiondate
A2AVisit2023/4/1
A2ACV2023/4/5
A2ACV2023/4/7
BB3Visit2023/5/5
BB3Visit2023/5/29
BB3CV2023/5/30
5EEVisit2023/6/1
5EECV2023/6/10

Expected results:

IDActiondate
CV_days_after_visit (expected result)
A2AVisit2023/4/1
A2ACV2023/4/54
A2ACV2023/4/76
BB3Visit2023/5/5
BB3Visit2023/5/29
BB3CV2023/5/301
5EEVisit2023/6/1
5EECV2023/6/109

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Hi,
    You can use a Visual Window recipe :
    https://knowledge.dataiku.com/latest/data-preparation/visual-recipes/concept-window-recipe.html
    Make sure to parse the date first and set the order column partition column, window definition, and then Lagdiff - to days.
    Screenshot 2024-03-07 at 7.14.26 AM.png

    Thanks

  • stephl
    stephl Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 8

    Hi, AlexT

    Thanks for your reply.

    I did try the Window recipe with the same setting, but the Lagdiff is calculated based on the previous 1 row only.

    As shown in my dataset, the CV and the last Visit dates are not necessarily next to each other...

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Indeed, I don't see a way to do this from the Visual recipe alone.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    I might try the following, breaking down the problem into smaller more solvable steps.

    0. First I’d need to parse the dates into a form on which I can do calculations and joins

    1. Use a split recipie to break up the dataset into a Visit and CV data set.

    2. Find candidate visit and CV pairs. Likely using a left join to connect the visit data set with the cv data sets using date. However not using Equal but greater than or equal on the date. So we are joining each visit to all CV after the Visit. This produces a long list of possible CV for a visit.

    3. Then I’d use a window function to order all of the visit CV sets, and adding a row number. I might also use a prepare or filter recipe prior to this to remove items that are not appropriate.

    4. Then I’d filter this large set of candidate CVs to the right one. Likely the most recent

    5. Calculate the time difference.

    This may not do exactly what you want to do. The idea is break up the dataset into two data sets, Visit and CV. Join the two datasets to find the candidate pairs. And eliminate the ones that are not useful.

    Then stack the Visit set and this new CV data set to get your resulting set.

    Once one gets it working in the broken down approach, then, I’d consider if there are ways to simplify.

    Good luck with this. Although you are not likely to be able to do this with a single Dataiku visual recipie step. I do think you can make the transformation you are trying to achieve. Over the years, I’ve done a number of projects like this with Dataiku DSS.

    Good luck, let us know how you get one with this.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,124 Neuron

    So I got it working but you need a Window Recipe and a Prepare Recipe. First the window recipe:

    Screenshot 2024-03-08 at 00.06.33.png

    Nothing on Aggregations as we need a Custom Aggregation:

    Screenshot 2024-03-08 at 00.10.49.png

    The output has now a new Visit_Date column:

    Screenshot 2024-03-08 at 00.11.31.png

    Finally we use a Prepare recipe with a If => Then processor to calculate the difference in days:

    Screenshot 2024-03-08 at 00.12.54.png

    PS: Custom Aggregations are not supported in the DSS engine so make sure you have your datasets in a SQL database.

Setup Info
    Tags
      Help me…