Detecting and Handling Long-Running Scenarios in Dataiku 12.5

Grixis · ‎03-15-2024

Hello Dataiku Community,

I am currently working with Dataiku 12.5 and seeking advice on how to effectively manage long-running scenarios or abnormals build time within my projects on the automation Node. The core of my inquiry revolves around the ability to detect scenarios that exceed a certain threshold in execution time, which can significantly impact resources and operational efficiency.

I am looking for a method or best practices within Dataiku 12.5 that would allow me to automatically detect such scenarios. Upon detection, I would like to have the option either to terminate the scenario automatically or to send an alert to the administrators for immediate action.

Specifically, I am interested in:

Detection Mechanism: Today, by tinkering with the Python API, I know I can retrieve project processing times, variabilize their values and use them as benchmarks. But it's still a rather heterodox and fragile approach, in my opinion.
Automated Response: If the scenarios in question can be exploited directly, why isn't it possible to carry out a postponement or control step that has already been integrated to evaluate the processing time, or is it the other way round? In the meantime, I've tried to do this by making an annex scenario to my second variable restitution scenario to go and monitor the project's production
Alert System: If opting for alerts, what is the best way to configure them in Dataiku 12.5? Can alerts be customized to provide detailed information about the scenario in question, including its execution time and potential impact on the system?

Thank you for your time,

Martin

Turribeach · ‎03-15-2024

There is no built-in way to do what you want so you will need to build your own solution to fit your needs. The Dataiku API is the way to go. Personally I would prefer something that runs outside of the scenario itself and it's just monitoring and alerting for suspected scenarios running longer than expected. But below is a sample on how you could control a build step within the step itself:

https://developer.dataiku.com/latest/concepts-and-examples/scenarios.html#define-scenario-timeout

But I wouldn't go as far as aborting steps or scenarios, there is just too much uncertainty in my view in using a linear estimation model to predict the processing times of complex data flows. In a lot of cases scenarios do overran previous run times for external reasons (more data, source systems being too busy, network congestion, database locks, etc) and let's not forget the internal reasons: DSS itself being too busy, new additional processing changes in the flow, bad user code, additional data causes exponential performance degration, etc.

Regarding alerts these can obviously customised to anything you want. In general the easiest way is to send scenario alerts is to use a Reporter within the scenario but that's something that runs within the scenario itself so it's not a good solution for you as you will most likely be monitoring scenarios from outside the scenario itself. Below is a solution using email:

https://community.dataiku.com/t5/Using-Dataiku/Is-it-possible-to-get-notified-when-a-job-exceeds-a-p...

But this can also be used to send notifications to other Dataiku supported notification channels like Slack, Microsoft Teams, Webhook and Twilio reporters.

Sign up to take part

Detecting and Handling Long-Running Scenarios in Dataiku 12.5

Detecting and Handling Long-Running Scenarios in Dataiku 12.5