Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello, I have a dataset like below and want to calculate the "CV" day difference since "Visit" per "ID", as the "CV_days_after_visit" in the expected result.
Is it possible to do by using the visual recipes of Dataiku (without coding)?
Existing dataset:
ID | Action | date |
A2A | Visit | 2023/4/1 |
A2A | CV | 2023/4/5 |
A2A | CV | 2023/4/7 |
BB3 | Visit | 2023/5/5 |
BB3 | Visit | 2023/5/29 |
BB3 | CV | 2023/5/30 |
5EE | Visit | 2023/6/1 |
5EE | CV | 2023/6/10 |
Expected results:
ID | Action | date | CV_days_after_visit (expected result) |
A2A | Visit | 2023/4/1 | |
A2A | CV | 2023/4/5 | 4 |
A2A | CV | 2023/4/7 | 6 |
BB3 | Visit | 2023/5/5 | |
BB3 | Visit | 2023/5/29 | |
BB3 | CV | 2023/5/30 | 1 |
5EE | Visit | 2023/6/1 | |
5EE | CV | 2023/6/10 | 9 |
Hi,
You can use a Visual Window recipe :
https://knowledge.dataiku.com/latest/data-preparation/visual-recipes/concept-window-recipe.html
Make sure to parse the date first and set the order column partition column, window definition, and then Lagdiff - to days.
Thanks
Hi, AlexT
Thanks for your reply.
I did try the Window recipe with the same setting, but the Lagdiff is calculated based on the previous 1 row only.
As shown in my dataset, the CV and the last Visit dates are not necessarily next to each other...
Indeed, I don't see a way to do this from the Visual recipe alone.
I might try the following, breaking down the problem into smaller more solvable steps.
0. First Iโd need to parse the dates into a form on which I can do calculations and joins
1. Use a split recipie to break up the dataset into a Visit and CV data set.
2. Find candidate visit and CV pairs. Likely using a left join to connect the visit data set with the cv data sets using date. However not using Equal but greater than or equal on the date. So we are joining each visit to all CV after the Visit. This produces a long list of possible CV for a visit.
3. Then Iโd use a window function to order all of the visit CV sets, and adding a row number. I might also use a prepare or filter recipe prior to this to remove items that are not appropriate.
4. Then Iโd filter this large set of candidate CVs to the right one. Likely the most recent
5. Calculate the time difference.
This may not do exactly what you want to do. The idea is break up the dataset into two data sets, Visit and CV. Join the two datasets to find the candidate pairs. And eliminate the ones that are not useful.
Then stack the Visit set and this new CV data set to get your resulting set.
Once one gets it working in the broken down approach, then, Iโd consider if there are ways to simplify.
Good luck with this. Although you are not likely to be able to do this with a single Dataiku visual recipie step. I do think you can make the transformation you are trying to achieve. Over the years, Iโve done a number of projects like this with Dataiku DSS.
Good luck, let us know how you get one with this.
So I got it working but you need a Window Recipe and a Prepare Recipe. First the window recipe:
Nothing on Aggregations as we need a Custom Aggregation:
The output has now a new Visit_Date column:
Finally we use a Prepare recipe with a If => Then processor to calculate the difference in days:
PS: Custom Aggregations are not supported in the DSS engine so make sure you have your datasets in a SQL database.