Using Lag and Lead in Dataiku

JimCreech Registered Posts: 5 ✭✭✭

I have time-series data (essentially a history recording a series of three events). I want to convert these three events into a single record using either or both a lead or lag type function and specific time conditions. Let me provide an example, where ID is the unique case number, EVENT# is which event was recorded, and DAYS is the number of days from an initial event (not recorded):


01 1 30

01 2 45

01 3 80

02 1 30

02 2 62

02 3 99

The condition I want to test for to identify a dummy variable goes something like this:

Within an ID, (if the difference in DAYS for Event = 2 compared to Event = 1 is between (1,30)) AND (if the difference in DAYS for Event = 3 compared to Event = 2 is > 30) then the dummy =1, else = 0

So in the two examples, above, ID = 01, Dummy =1 because both conditions are met (condition 1 = 15 and condition 2 = 35), but with ID = 02, Dummy =0 because the first condition was not met (condition 1 = 32, condition 2 = 37).

I can do this in SAS by working from Event = 3 and lagging once for condition 2 and using two separate lag functions for condition 1 or by separating the conditions, using the lag function, then aggregating the results, but I need to do this within Dataiku.

Best Answer

  • HarizoR
    HarizoR Dataiker, Alpha Tester, Registered Posts: 138 Dataiker
    Answer ✓

    Hi JimCreech,

    It is possible to achieve a similar result in Dataiku by going through the following steps:

    • run a Window recipe using id as partitioning column, event as ordering column, and compute the lagdiff values on days
    • run a Prepare recipe which will:
      • run the assertions on the specific conditions you mentioned and create a dummy_tmp column we'll reuse at the final step
      • filter out the records where event == 01 (since we won't need it for the aggregation)
    • run a Groupby recipe to aggregate at the id level and get the minimum of the dummy_tmp column (which should be 1 if all conditions you described are verified, 0 otherwise).

    You will find attached a sample DSS project with an example of implementation of those steps using the small example data mentioned in your post. If you are working on a regular basis with time series data, you may find this tutorial on the Window recipe capabilities useful.

    Hope this helps !



Setup Info
      Help me…