Using Lag and Lead in Dataiku

I have time-series data (essentially a history recording a series of three events). I want to convert these three events into a single record using either or both a lead or lag type function and specific time conditions. Let me provide an example, where ID is the unique case number, EVENT# is which event was recorded, and DAYS is the number of days from an initial event (not recorded):
ID EVENT# DAYS
01 1 30
01 2 45
01 3 80
02 1 30
02 2 62
02 3 99
The condition I want to test for to identify a dummy variable goes something like this:
Within an ID, (if the difference in DAYS for Event = 2 compared to Event = 1 is between (1,30)) AND (if the difference in DAYS for Event = 3 compared to Event = 2 is > 30) then the dummy =1, else = 0
So in the two examples, above, ID = 01, Dummy =1 because both conditions are met (condition 1 = 15 and condition 2 = 35), but with ID = 02, Dummy =0 because the first condition was not met (condition 1 = 32, condition 2 = 37).
I can do this in SAS by working from Event = 3 and lagging once for condition 2 and using two separate lag functions for condition 1 or by separating the conditions, using the lag function, then aggregating the results, but I need to do this within Dataiku.
Best Answer
-
Hi JimCreech,
It is possible to achieve a similar result in Dataiku by going through the following steps:
- run a Window recipe using id as partitioning column, event as ordering column, and compute the lagdiff values on days
- run a Prepare recipe which will:
- run the assertions on the specific conditions you mentioned and create a dummy_tmp column we'll reuse at the final step
- filter out the records where event == 01 (since we won't need it for the aggregation)
- run a Groupby recipe to aggregate at the id level and get the minimum of the dummy_tmp column (which should be 1 if all conditions you described are verified, 0 otherwise).
You will find attached a sample DSS project with an example of implementation of those steps using the small example data mentioned in your post. If you are working on a regular basis with time series data, you may find this tutorial on the Window recipe capabilities useful.
Hope this helps !
Best,
Harizo