Checking activity conflict over multiple sources

manssari
Level 2
Checking activity conflict over multiple sources

Dear Dataiku Community,

I have a workflow where I have data coming from different sources, with multiple activities ( activities starting on "x" date, and taking "y" hours)
I am trying to set a new conflict column that gives flag ( True) when the same activity from different sources falls within the same period,
In short:

Here is an example of what I want to achieve:
This is my data:

SourceACTIVITYDATE_parsedHours
Source3CCC2023-08-31T04:00:00.000Z10
Source3BBB2023-10-15T04:00:00.000Z6
Source3AAA2023-08-17T04:00:00.000Z6
Source3BBB2023-08-31T04:00:00.000Z6
Source3AAA2023-08-3T04:00:00.000Z5
Source1CCC2023-08-31T06:00:00.000Z10
Source1BBB2023-10-31T12:00:00.000Z6
Source1AAA2023-08-17T04:00:00.000Z2
Source1BBB2023-08-5T22:00:00.000Z6
Source1AAA2023-08-2T04:00:00.000Z6
Source1BBB2023-08-31T04:00:00.000Z6
Source1AAA2023-08-31T04:00:00.000Z6
Source5BBB2023-08-6T02:00:00.000Z6
Source1AAA2023-08-22T04:00:00.000Z6
Source1BBB2023-08-21T04:00:00.000Z6
Source1AAA2023-08-11T04:00:00.000Z6
Source1BBB2023-08-10T9:00:00.000Z6
Source1AAA2023-08-3T04:00:00.000Z6
Source4CCC2023-08-1T04:00:00.000Z5


I need to create a new column based on rules above so that this is achieved:

SourceACTIVITYDATE_parsedHours 
Source3CCC2023-08-31T04:00:00.000Z10Possible Conflict with Source 1
Source3BBB2023-10-15T04:00:00.000Z6None
Source3AAA2023-08-17T04:00:00.000Z6Possibel Conflict with Soucre 1
Source3BBB2023-08-31T04:00:00.000Z6None
Source3AAA2023-08-3T04:00:00.000Z5None
Source1CCC2023-08-31T06:00:00.000Z10Possible Conflict with Source 3
Source1BBB2023-10-31T12:00:00.000Z6None
Source1AAA2023-08-17T04:00:00.000Z2Possible Conflict with Source 3
Source1BBB2023-08-5T22:00:00.000Z6Possible Conflict with Source 5
Source1AAA2023-08-2T04:00:00.000Z6None
Source1BBB2023-08-31T04:00:00.000Z6None
Source1AAA2023-08-31T04:00:00.000Z6None
Source5BBB2023-08-6T02:00:00.000Z6Possible Conflict with Source 1
Source1AAA2023-08-22T04:00:00.000Z6None
Source1BBB2023-08-21T04:00:00.000Z6None
Source1AAA2023-08-11T04:00:00.000Z6None
Source1BBB2023-08-10T9:00:00.000Z6None
Source1AAA2023-08-3T04:00:00.000Z6None
Source4CCC2023-08-1T04:00:00.000Z5None

 

Kindly, is there any advice on which tool/ recipe can help me achieve this ?
even if python code recipe -pandas data frame, , knowing that I can not use SQL due to my data source,

Thank you,


Operating system used: Centos 7.9

0 Kudos
2 Replies
Turribeach

Hi, your second table is the same as the first, at least it looks like that to me. In any case I think I understand what you want, you basically want to check for overlapping time ranges. Here is a suggestion on how to do that in Python:

https://stackoverflow.com/questions/35644301/checking-two-time-intervals-are-overlapping-or-not

Personally I don't like that solution since it will require a row by row comparasion in a for loop which will be pretty slow I think, unless your dataset is pretty small. I think this could be done much easier with Window recipe using the Lag function to compare the current row against the previous row. First you need to have all your date/time fields for your activities as date_time_from and date_time_to, you should be able to achieve that using a Prepare recipe by adding Hours to Date_parsed. Once you do that you then do the Window recipe and partition by Activity and Sort by date_time_from ascending and get the lag of both date_time_from and date_time_to. Then a simple If formula should do it (this pseudo code):

if date_time_from <= previous_date_time_from then

   overlap_flag = "Yes"

else if date_time_to >= previous_date_time_to then

   overlap_flag = "Yes"

end if

Note this solution only checks for overlaps against the previous row. But I think it should be good enough. Otherwise please clarify exactly how you need it to work and why. Thanks

0 Kudos
manssari
Level 2
Author

Hello Turribeach,


Thank you for your reply,

The second table has an additional column at the end of expected result (or at least True for overlap /False)
I have been trying to check the solutions you suggested, 

however, there is one issue:
Data is coming from multiple sources , ( as in, facility 1 is doing activity B, Facility 2 is doing activity A)
the idea is that, the time stamp is the same ( start time) end time is different depending on how long an activity could take ( start time + hours)

I do not need flag to happen for same service if it coming from the same source (facility) at the same time,

Only if there is an overlap in timestamp of that same activity between multiple different facilities,

I tried to do it in python,

But it is not giving me the required result

I also added Postgres database & converted my data to that t get sql solutions flexibility/ faster

but I couldn't achieve row iterations in it


 

0 Kudos