How to calculate day difference between rows with specific conditions
Hello, I have a dataset like below and want to calculate the "CV" day difference since "Visit" per "ID", as the "CV_days_after_visit" in the expected result.
Is it possible to do by using the visual recipes of Dataiku (without coding)?
Existing dataset:
ID | Action | date |
A2A | Visit | 2023/4/1 |
A2A | CV | 2023/4/5 |
A2A | CV | 2023/4/7 |
BB3 | Visit | 2023/5/5 |
BB3 | Visit | 2023/5/29 |
BB3 | CV | 2023/5/30 |
5EE | Visit | 2023/6/1 |
5EE | CV | 2023/6/10 |
Expected results:
ID | Action | date | CV_days_after_visit (expected result) |
A2A | Visit | 2023/4/1 | |
A2A | CV | 2023/4/5 | 4 |
A2A | CV | 2023/4/7 | 6 |
BB3 | Visit | 2023/5/5 | |
BB3 | Visit | 2023/5/29 | |
BB3 | CV | 2023/5/30 | 1 |
5EE | Visit | 2023/6/1 | |
5EE | CV | 2023/6/10 | 9 |
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
You can use a Visual Window recipe :
https://knowledge.dataiku.com/latest/data-preparation/visual-recipes/concept-window-recipe.html
Make sure to parse the date first and set the order column partition column, window definition, and then Lagdiff - to days.
Thanks -
stephl Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 8 ✭
Hi, AlexT
Thanks for your reply.
I did try the Window recipe with the same setting, but the Lagdiff is calculated based on the previous 1 row only.
As shown in my dataset, the CV and the last Visit dates are not necessarily next to each other...
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Indeed, I don't see a way to do this from the Visual recipe alone.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
I might try the following, breaking down the problem into smaller more solvable steps.
0. First I’d need to parse the dates into a form on which I can do calculations and joins
1. Use a split recipie to break up the dataset into a Visit and CV data set.
2. Find candidate visit and CV pairs. Likely using a left join to connect the visit data set with the cv data sets using date. However not using Equal but greater than or equal on the date. So we are joining each visit to all CV after the Visit. This produces a long list of possible CV for a visit.
3. Then I’d use a window function to order all of the visit CV sets, and adding a row number. I might also use a prepare or filter recipe prior to this to remove items that are not appropriate.
4. Then I’d filter this large set of candidate CVs to the right one. Likely the most recent
5. Calculate the time difference.
This may not do exactly what you want to do. The idea is break up the dataset into two data sets, Visit and CV. Join the two datasets to find the candidate pairs. And eliminate the ones that are not useful.
Then stack the Visit set and this new CV data set to get your resulting set.Once one gets it working in the broken down approach, then, I’d consider if there are ways to simplify.
Good luck with this. Although you are not likely to be able to do this with a single Dataiku visual recipie step. I do think you can make the transformation you are trying to achieve. Over the years, I’ve done a number of projects like this with Dataiku DSS.Good luck, let us know how you get one with this.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,124 Neuron
So I got it working but you need a Window Recipe and a Prepare Recipe. First the window recipe:
Nothing on Aggregations as we need a Custom Aggregation:
The output has now a new Visit_Date column:
Finally we use a Prepare recipe with a If => Then processor to calculate the difference in days:
PS: Custom Aggregations are not supported in the DSS engine so make sure you have your datasets in a SQL database.