How to calculate day difference between rows with specific conditions

stephl
Level 2
How to calculate day difference between rows with specific conditions

Hello, I have a dataset like below and want to calculate the "CV" day difference since "Visit" per "ID", as the "CV_days_after_visit" in the expected result.

Is it possible to do by using the visual recipes of Dataiku (without coding)?

Existing dataset:

IDActiondate
A2AVisit2023/4/1
A2ACV2023/4/5
A2ACV2023/4/7
BB3Visit2023/5/5
BB3Visit2023/5/29
BB3CV2023/5/30
5EEVisit2023/6/1
5EECV2023/6/10

 

Expected results:

IDActiondate
CV_days_after_visit (expected result)
A2AVisit2023/4/1 
A2ACV2023/4/54
A2ACV2023/4/76
BB3Visit2023/5/5 
BB3Visit2023/5/29 
BB3CV2023/5/301
5EEVisit2023/6/1 
5EECV2023/6/109
5 Replies
AlexT
Dataiker

Hi,
You can use a Visual Window recipe :
https://knowledge.dataiku.com/latest/data-preparation/visual-recipes/concept-window-recipe.html
Make sure to parse the date first and set the order column partition column, window definition, and then Lagdiff - to days.  
Screenshot 2024-03-07 at 7.14.26 AM.png

Thanks

0 Kudos
stephl
Level 2
Author

Hi, AlexT

Thanks for your reply.

I did try the Window recipe with the same setting, but the Lagdiff is calculated based on the previous 1 row only.

As shown in my dataset, the CV and the last Visit dates are not necessarily next to each other...

 

0 Kudos
AlexT
Dataiker

Indeed, I don't see a way to do this from the Visual recipe alone. 

0 Kudos
tgb417

I might try the following, breaking down the problem into smaller more solvable steps. 

0. First I’d need to parse the dates into a form on which I can do calculations and joins

1. Use a split recipie to break up the dataset into a Visit and CV data set.

2. Find candidate visit and CV pairs.  Likely using a left join to connect the visit data set with the cv data sets using date.  However not using Equal but greater than or equal on the date.  So we are joining each visit to all CV after the Visit.  This produces a long list of possible CV for a visit.

3. Then I’d use a window function to order all of the visit CV sets, and adding a row number.  I might also use a prepare or filter recipe prior to this to remove items that are not appropriate. 

4. Then I’d filter this large set of candidate CVs to the right one.  Likely the most recent

5. Calculate the time difference.

This may not do exactly what you want to do.  The idea is break up the dataset into two data sets, Visit and CV.  Join the two datasets to find the candidate pairs.  And eliminate the ones that are not useful.  

Then stack the Visit set and this new CV data set to get your resulting set.

Once one gets it working in the broken down approach, then, I’d consider if there are ways to simplify. 

Good luck with this.  Although you are not likely to be able to do this with a single Dataiku visual recipie step.  I do think you can make the transformation you are trying to achieve.  Over the years, I’ve done a number of projects like this with Dataiku DSS.

Good luck, let us know how you get one with this.  

--Tom
0 Kudos
Turribeach

So I got it working but you need a Window Recipe and a Prepare Recipe. First the window recipe:

Screenshot 2024-03-08 at 00.06.33.png

Nothing on Aggregations as we need a Custom Aggregation:

Screenshot 2024-03-08 at 00.10.49.png

The output has now a new Visit_Date column:

Screenshot 2024-03-08 at 00.11.31.png

Finally we use a Prepare recipe with a If => Then processor to calculate the difference in days:

Screenshot 2024-03-08 at 00.12.54.png

PS: Custom Aggregations are not supported in the DSS engine so make sure you have your datasets in a SQL database.