Detecting and Handling Long-Running Scenarios in Dataiku 12.5

Grixis · ‎03-15-2024

Hello Dataiku Community,

I am currently working with Dataiku 12.5 and seeking advice on how to effectively manage long-running scenarios or abnormals build time within my projects on the automation Node. The core of my inquiry revolves around the ability to detect scenarios that exceed a certain threshold in execution time, which can significantly impact resources and operational efficiency.

I am looking for a method or best practices within Dataiku 12.5 that would allow me to automatically detect such scenarios. Upon detection, I would like to have the option either to terminate the scenario automatically or to send an alert to the administrators for immediate action.

Specifically, I am interested in:

Detection Mechanism: Today, by tinkering with the Python API, I know I can retrieve project processing times, variabilize their values and use them as benchmarks. But it's still a rather heterodox and fragile approach, in my opinion.
Automated Response: If the scenarios in question can be exploited directly, why isn't it possible to carry out a postponement or control step that has already been integrated to evaluate the processing time, or is it the other way round? In the meantime, I've tried to do this by making an annex scenario to my second variable restitution scenario to go and monitor the project's production
Alert System: If opting for alerts, what is the best way to configure them in Dataiku 12.5? Can alerts be customized to provide detailed information about the scenario in question, including its execution time and potential impact on the system?

Thank you for your time,

Martin

Turribeach · ‎03-15-2024

There is no built-in way to do what you want so you will need to build your own solution to fit your needs. The Dataiku API is the way to go. Personally I would prefer something that runs outside of the scenario itself and it's just monitoring and alerting for suspected scenarios running longer than expected. But below is a sample on how you could control a build step within the step itself:

https://developer.dataiku.com/latest/concepts-and-examples/scenarios.html#define-scenario-timeout

But I wouldn't go as far as aborting steps or scenarios, there is just too much uncertainty in my view in using a linear estimation model to predict the processing times of complex data flows. In a lot of cases scenarios do overran previous run times for external reasons (more data, source systems being too busy, network congestion, database locks, etc) and let's not forget the internal reasons: DSS itself being too busy, new additional processing changes in the flow, bad user code, additional data causes exponential performance degration, etc.

Regarding alerts these can obviously customised to anything you want. In general the easiest way is to send scenario alerts is to use a Reporter within the scenario but that's something that runs within the scenario itself so it's not a good solution for you as you will most likely be monitoring scenarios from outside the scenario itself. Below is a solution using email:

https://community.dataiku.com/t5/Using-Dataiku/Is-it-possible-to-get-notified-when-a-job-exceeds-a-p...

But this can also be used to send notifications to other Dataiku supported notification channels like Slack, Microsoft Teams, Webhook and Twilio reporters.

Grixis

Hello @Turribeach ,

Thanks for your answer.

I had the same conclusion when I was on this point so, we added a mandatory step in ours projects concerned by using the dss python api as in the example below;

import time
import dataiku
from dataiku.scenario import Scenario

# Initialize the Dataiku scenario instance
scenario = Scenario()

# Access the current project
client = dataiku.api_client()
project = client.get_default_project()

# Retrieve project variables and set the average scenario timeout
project_variables = project.get_variables()
# Assume 'average_scenario_timeout' is defined in the project variables
# Set a default timeout of 3600 seconds if the variable isn't set
TIMEOUT_SECONDS = float(project_variables["standard"].get("average_scenario_timeout", 3600))

# Substitute your "dataset_name" here
dataset_name = "your_dataset_name"  # Replace with your actual dataset name

# Start building the dataset
step_handle = scenario.build_dataset(dataset_name, asynchronous=True)

start_time = time.time()

# Check if the build is finished within the TIMEOUT_SECONDS
while not step_handle.is_done():
    elapsed_time = time.time() - start_time
    
    if elapsed_time > TIMEOUT_SECONDS:
        # Abort the build if the timeout is exceeded
        step_handle.abort()
        # Define alerting system rules
        raise Exception("Scenario interrupted: average processing time limit exceeded.")
    else:
        print(f"Currently running... Duration: {int(elapsed_time)}s")
        # Wait before the next check to minimize load
        time.sleep(10)

And as you rightly point out, there are limitations to this approach:

- Rigidity, because the limit is a registered project variable with an arbitrary average processing time

- Unable to arbitrate on a slowdown due to the platform or base where the processing is carried out.

Nevertheless, it remains a viable solution that allows us to manage processing ranges from the DSS platform instead of autonomously without having to go through our orchestrator and outsource this.

rmoore

We do have a custom built plugin to accomplish what I believe you're looking for - feel free to DM me for more info

Grixis

Ok thanks

Sign up to take part

Detecting and Handling Long-Running Scenarios in Dataiku 12.5

Detecting and Handling Long-Running Scenarios in Dataiku 12.5