Submit your use case or success story to the 2023 edition of the Dataiku Frontrunner Awards

Using Lag and Lead in Dataiku

Solved!
Level 1
Using Lag and Lead in Dataiku

I have time-series data (essentially a history recording a series of three events).  I want to convert these three events into a single record using either or both a lead or lag type function and specific time conditions. Let me provide an example, where ID is the unique case number, EVENT# is which event was recorded, and DAYS is the number of days from an initial event (not recorded):

ID  EVENT#  DAYS

01  1              30

01   2              45

01   3              80

02  1               30

02   2              62

02   3              99

The condition I want to test for to identify a dummy variable goes something like this:

Within an ID, (if the difference in DAYS for Event = 2 compared to Event = 1 is between (1,30)) AND (if the difference in DAYS for Event = 3 compared to Event = 2 is > 30) then the dummy =1, else = 0

So in the two examples, above, ID = 01, Dummy =1 because both conditions are met (condition 1 = 15 and condition 2 = 35), but with ID = 02, Dummy =0 because the first condition was not met (condition 1 = 32, condition 2 = 37).

I can do this in SAS by working from Event = 3 and lagging once for condition 2 and using two separate lag functions for condition 1 or by separating the conditions, using the lag function, then aggregating the results, but I need to do this within Dataiku.

1 Solution

Hi JimCreech,

It is possible to achieve a similar result in Dataiku by going through the following steps:

• run a Window recipe using id as partitioning column, event as ordering column, and compute the lagdiff values on days
• run a Prepare recipe which will:
• run the assertions on the specific conditions you mentioned and create a dummy_tmp column we'll reuse at the final step
• filter out the records where event == 01 (since we won't need it for the aggregation)
• run a Groupby recipe to aggregate at the id level and get the minimum of the dummy_tmp column (which should be 1 if all conditions you described are verified, 0 otherwise).

You will find attached a sample DSS project with an example of implementation of those steps using the small example data mentioned in your post. If you are working on a regular basis with time series data, you may find this tutorial on the Window recipe capabilities useful.

Hope this helps !

Best,

Harizo

Hi JimCreech,

It is possible to achieve a similar result in Dataiku by going through the following steps:

• run a Window recipe using id as partitioning column, event as ordering column, and compute the lagdiff values on days
• run a Prepare recipe which will:
• run the assertions on the specific conditions you mentioned and create a dummy_tmp column we'll reuse at the final step
• filter out the records where event == 01 (since we won't need it for the aggregation)
• run a Groupby recipe to aggregate at the id level and get the minimum of the dummy_tmp column (which should be 1 if all conditions you described are verified, 0 otherwise).

You will find attached a sample DSS project with an example of implementation of those steps using the small example data mentioned in your post. If you are working on a regular basis with time series data, you may find this tutorial on the Window recipe capabilities useful.

Hope this helps !

Best,

Harizo