Checking activity conflict over multiple sources
Dear Dataiku Community,
I have a workflow where I have data coming from different sources, with multiple activities ( activities starting on "x" date, and taking "y" hours)
I am trying to set a new conflict column that gives flag ( True) when the same activity from different sources falls within the same period,
In short:
Here is an example of what I want to achieve:
This is my data:
Source | ACTIVITY | DATE_parsed | Hours |
Source3 | CCC | 2023-08-31T04:00:00.000Z | 10 |
Source3 | BBB | 2023-10-15T04:00:00.000Z | 6 |
Source3 | AAA | 2023-08-17T04:00:00.000Z | 6 |
Source3 | BBB | 2023-08-31T04:00:00.000Z | 6 |
Source3 | AAA | 2023-08-3T04:00:00.000Z | 5 |
Source1 | CCC | 2023-08-31T06:00:00.000Z | 10 |
Source1 | BBB | 2023-10-31T12:00:00.000Z | 6 |
Source1 | AAA | 2023-08-17T04:00:00.000Z | 2 |
Source1 | BBB | 2023-08-5T22:00:00.000Z | 6 |
Source1 | AAA | 2023-08-2T04:00:00.000Z | 6 |
Source1 | BBB | 2023-08-31T04:00:00.000Z | 6 |
Source1 | AAA | 2023-08-31T04:00:00.000Z | 6 |
Source5 | BBB | 2023-08-6T02:00:00.000Z | 6 |
Source1 | AAA | 2023-08-22T04:00:00.000Z | 6 |
Source1 | BBB | 2023-08-21T04:00:00.000Z | 6 |
Source1 | AAA | 2023-08-11T04:00:00.000Z | 6 |
Source1 | BBB | 2023-08-10T9:00:00.000Z | 6 |
Source1 | AAA | 2023-08-3T04:00:00.000Z | 6 |
Source4 | CCC | 2023-08-1T04:00:00.000Z | 5 |
I need to create a new column based on rules above so that this is achieved:
Source | ACTIVITY | DATE_parsed | Hours | |
Source3 | CCC | 2023-08-31T04:00:00.000Z | 10 | Possible Conflict with Source 1 |
Source3 | BBB | 2023-10-15T04:00:00.000Z | 6 | None |
Source3 | AAA | 2023-08-17T04:00:00.000Z | 6 | Possibel Conflict with Soucre 1 |
Source3 | BBB | 2023-08-31T04:00:00.000Z | 6 | None |
Source3 | AAA | 2023-08-3T04:00:00.000Z | 5 | None |
Source1 | CCC | 2023-08-31T06:00:00.000Z | 10 | Possible Conflict with Source 3 |
Source1 | BBB | 2023-10-31T12:00:00.000Z | 6 | None |
Source1 | AAA | 2023-08-17T04:00:00.000Z | 2 | Possible Conflict with Source 3 |
Source1 | BBB | 2023-08-5T22:00:00.000Z | 6 | Possible Conflict with Source 5 |
Source1 | AAA | 2023-08-2T04:00:00.000Z | 6 | None |
Source1 | BBB | 2023-08-31T04:00:00.000Z | 6 | None |
Source1 | AAA | 2023-08-31T04:00:00.000Z | 6 | None |
Source5 | BBB | 2023-08-6T02:00:00.000Z | 6 | Possible Conflict with Source 1 |
Source1 | AAA | 2023-08-22T04:00:00.000Z | 6 | None |
Source1 | BBB | 2023-08-21T04:00:00.000Z | 6 | None |
Source1 | AAA | 2023-08-11T04:00:00.000Z | 6 | None |
Source1 | BBB | 2023-08-10T9:00:00.000Z | 6 | None |
Source1 | AAA | 2023-08-3T04:00:00.000Z | 6 | None |
Source4 | CCC | 2023-08-1T04:00:00.000Z | 5 | None |
Kindly, is there any advice on which tool/ recipe can help me achieve this ?
even if python code recipe -pandas data frame, , knowing that I can not use SQL due to my data source,
Thank you,
Operating system used: Centos 7.9
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,990 Neuron
Hi, your second table is the same as the first, at least it looks like that to me. In any case I think I understand what you want, you basically want to check for overlapping time ranges. Here is a suggestion on how to do that in Python:
https://stackoverflow.com/questions/35644301/checking-two-time-intervals-are-overlapping-or-not
Personally I don't like that solution since it will require a row by row comparasion in a for loop which will be pretty slow I think, unless your dataset is pretty small. I think this could be done much easier with Window recipe using the Lag function to compare the current row against the previous row. First you need to have all your date/time fields for your activities as date_time_from and date_time_to, you should be able to achieve that using a Prepare recipe by adding Hours to Date_parsed. Once you do that you then do the Window recipe and partition by Activity and Sort by date_time_from ascending and get the lag of both date_time_from and date_time_to. Then a simple If formula should do it (this pseudo code):
if date_time_from <= previous_date_time_from then
overlap_flag = "Yes"
else if date_time_to >= previous_date_time_to then
overlap_flag = "Yes"
end if
Note this solution only checks for overlaps against the previous row. But I think it should be good enough. Otherwise please clarify exactly how you need it to work and why. Thanks
-
Maryam Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 7 Partner
Hello Turribeach,
Thank you for your reply,The second table has an additional column at the end of expected result (or at least True for overlap /False)
I have been trying to check the solutions you suggested,however, there is one issue:
Data is coming from multiple sources , ( as in, facility 1 is doing activity B, Facility 2 is doing activity A)
the idea is that, the time stamp is the same ( start time) end time is different depending on how long an activity could take ( start time + hours)I do not need flag to happen for same service if it coming from the same source (facility) at the same time,
Only if there is an overlap in timestamp of that same activity between multiple different facilities,
I tried to do it in python,
But it is not giving me the required resultI also added Postgres database & converted my data to that t get sql solutions flexibility/ faster
but I couldn't achieve row iterations in it