Detecting and Handling Long-Running Scenarios in Dataiku 12.5
Hello Dataiku Community,
I am currently working with Dataiku 12.5 and seeking advice on how to effectively manage long-running scenarios or abnormals build time within my projects on the automation Node. The core of my inquiry revolves around the ability to detect scenarios that exceed a certain threshold in execution time, which can significantly impact resources and operational efficiency.
I am looking for a method or best practices within Dataiku 12.5 that would allow me to automatically detect such scenarios. Upon detection, I would like to have the option either to terminate the scenario automatically or to send an alert to the administrators for immediate action.
Specifically, I am interested in:
Detection Mechanism: Today, by tinkering with the Python API, I know I can retrieve project processing times, variabilize their values and use them as benchmarks. But it's still a rather heterodox and fragile approach, in my opinion.
Automated Response: If the scenarios in question can be exploited directly, why isn't it possible to carry out a postponement or control step that has already been integrated to evaluate the processing time, or is it the other way round? In the meantime, I've tried to do this by making an annex scenario to my second variable restitution scenario to go and monitor the project's production
Alert System: If opting for alerts, what is the best way to configure them in Dataiku 12.5? Can alerts be customized to provide detailed information about the scenario in question, including its execution time and potential impact on the system?
Thank you for your time,
Martin
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
There is no built-in way to do what you want so you will need to build your own solution to fit your needs. The Dataiku API is the way to go. Personally I would prefer something that runs outside of the scenario itself and it's just monitoring and alerting for suspected scenarios running longer than expected. But below is a sample on how you could control a build step within the step itself:
https://developer.dataiku.com/latest/concepts-and-examples/scenarios.html#define-scenario-timeout
But I wouldn't go as far as aborting steps or scenarios, there is just too much uncertainty in my view in using a linear estimation model to predict the processing times of complex data flows. In a lot of cases scenarios do overran previous run times for external reasons (more data, source systems being too busy, network congestion, database locks, etc) and let's not forget the internal reasons: DSS itself being too busy, new additional processing changes in the flow, bad user code, additional data causes exponential performance degration, etc.
Regarding alerts these can obviously customised to anything you want. In general the easiest way is to send scenario alerts is to use a Reporter within the scenario but that's something that runs within the scenario itself so it's not a good solution for you as you will most likely be monitoring scenarios from outside the scenario itself. Below is a solution using email:
But this can also be used to send notifications to other Dataiku supported notification channels like Slack, Microsoft Teams, Webhook and Twilio reporters.
-
Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 82 ✭✭✭✭✭
Hello @Turribeach
,Thanks for your answer.
I had the same conclusion when I was on this point so, we added a mandatory step in ours projects concerned by using the dss python api as in the example below;
import time import dataiku from dataiku.scenario import Scenario # Initialize the Dataiku scenario instance scenario = Scenario() # Access the current project client = dataiku.api_client() project = client.get_default_project() # Retrieve project variables and set the average scenario timeout project_variables = project.get_variables() # Assume 'average_scenario_timeout' is defined in the project variables # Set a default timeout of 3600 seconds if the variable isn't set TIMEOUT_SECONDS = float(project_variables["standard"].get("average_scenario_timeout", 3600)) # Substitute your "dataset_name" here dataset_name = "your_dataset_name" # Replace with your actual dataset name # Start building the dataset step_handle = scenario.build_dataset(dataset_name, asynchronous=True) start_time = time.time() # Check if the build is finished within the TIMEOUT_SECONDS while not step_handle.is_done(): elapsed_time = time.time() - start_time if elapsed_time > TIMEOUT_SECONDS: # Abort the build if the timeout is exceeded step_handle.abort() # Define alerting system rules raise Exception("Scenario interrupted: average processing time limit exceeded.") else: print(f"Currently running... Duration: {int(elapsed_time)}s") # Wait before the next check to minimize load time.sleep(10)
And as you rightly point out, there are limitations to this approach:
- Rigidity, because the limit is a registered project variable with an arbitrary average processing time- Unable to arbitrate on a slowdown due to the platform or base where the processing is carried out.
Nevertheless, it remains a viable solution that allows us to manage processing ranges from the DSS platform instead of autonomously without having to go through our orchestrator and outsource this.
-
rmoore Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Participant, Neuron 2023 Posts: 33 Neuron
We do have a custom built plugin to accomplish what I believe you're looking for - feel free to DM me for more info
-
Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 82 ✭✭✭✭✭
Ok thanks