Detecting and Handling Long-Running Scenarios in Dataiku 12.5

Grixis
Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 80 ✭✭✭✭✭

Hello Dataiku Community,

I am currently working with Dataiku 12.5 and seeking advice on how to effectively manage long-running scenarios or abnormals build time within my projects on the automation Node. The core of my inquiry revolves around the ability to detect scenarios that exceed a certain threshold in execution time, which can significantly impact resources and operational efficiency.

I am looking for a method or best practices within Dataiku 12.5 that would allow me to automatically detect such scenarios. Upon detection, I would like to have the option either to terminate the scenario automatically or to send an alert to the administrators for immediate action.

Specifically, I am interested in:

  1. Detection Mechanism: Today, by tinkering with the Python API, I know I can retrieve project processing times, variabilize their values and use them as benchmarks. But it's still a rather heterodox and fragile approach, in my opinion.

  2. Automated Response: If the scenarios in question can be exploited directly, why isn't it possible to carry out a postponement or control step that has already been integrated to evaluate the processing time, or is it the other way round? In the meantime, I've tried to do this by making an annex scenario to my second variable restitution scenario to go and monitor the project's production

  3. Alert System: If opting for alerts, what is the best way to configure them in Dataiku 12.5? Can alerts be customized to provide detailed information about the scenario in question, including its execution time and potential impact on the system?

Thank you for your time,

Martin

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,067 Neuron

    There is no built-in way to do what you want so you will need to build your own solution to fit your needs. The Dataiku API is the way to go. Personally I would prefer something that runs outside of the scenario itself and it's just monitoring and alerting for suspected scenarios running longer than expected. But below is a sample on how you could control a build step within the step itself:

    https://developer.dataiku.com/latest/concepts-and-examples/scenarios.html#define-scenario-timeout

    But I wouldn't go as far as aborting steps or scenarios, there is just too much uncertainty in my view in using a linear estimation model to predict the processing times of complex data flows. In a lot of cases scenarios do overran previous run times for external reasons (more data, source systems being too busy, network congestion, database locks, etc) and let's not forget the internal reasons: DSS itself being too busy, new additional processing changes in the flow, bad user code, additional data causes exponential performance degration, etc.

    Regarding alerts these can obviously customised to anything you want. In general the easiest way is to send scenario alerts is to use a Reporter within the scenario but that's something that runs within the scenario itself so it's not a good solution for you as you will most likely be monitoring scenarios from outside the scenario itself. Below is a solution using email:

    https://community.dataiku.com/t5/Using-Dataiku/Is-it-possible-to-get-notified-when-a-job-exceeds-a-pre/m-p/33089

    But this can also be used to send notifications to other Dataiku supported notification channels like Slack, Microsoft Teams, Webhook and Twilio reporters.

  • Grixis
    Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 80 ✭✭✭✭✭
    edited July 17

    Hello @Turribeach
    ,

    Thanks for your answer.

    I had the same conclusion when I was on this point so, we added a mandatory step in ours projects concerned by using the dss python api as in the example below;

    import time
    import dataiku
    from dataiku.scenario import Scenario
    
    # Initialize the Dataiku scenario instance
    scenario = Scenario()
    
    # Access the current project
    client = dataiku.api_client()
    project = client.get_default_project()
    
    # Retrieve project variables and set the average scenario timeout
    project_variables = project.get_variables()
    # Assume 'average_scenario_timeout' is defined in the project variables
    # Set a default timeout of 3600 seconds if the variable isn't set
    TIMEOUT_SECONDS = float(project_variables["standard"].get("average_scenario_timeout", 3600))
    
    # Substitute your "dataset_name" here
    dataset_name = "your_dataset_name"  # Replace with your actual dataset name
    
    # Start building the dataset
    step_handle = scenario.build_dataset(dataset_name, asynchronous=True)
    
    start_time = time.time()
    
    # Check if the build is finished within the TIMEOUT_SECONDS
    while not step_handle.is_done():
        elapsed_time = time.time() - start_time
        
        if elapsed_time > TIMEOUT_SECONDS:
            # Abort the build if the timeout is exceeded
            step_handle.abort()
            # Define alerting system rules
            raise Exception("Scenario interrupted: average processing time limit exceeded.")
        else:
            print(f"Currently running... Duration: {int(elapsed_time)}s")
            # Wait before the next check to minimize load
            time.sleep(10)

    And as you rightly point out, there are limitations to this approach:


    - Rigidity, because the limit is a registered project variable with an arbitrary average processing time

    - Unable to arbitrate on a slowdown due to the platform or base where the processing is carried out.

    Nevertheless, it remains a viable solution that allows us to manage processing ranges from the DSS platform instead of autonomously without having to go through our orchestrator and outsource this.

  • rmoore
    rmoore Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Participant, Neuron 2023 Posts: 33 Neuron

    We do have a custom built plugin to accomplish what I believe you're looking for - feel free to DM me for more info

  • Grixis
    Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 80 ✭✭✭✭✭
  • Grixis
    Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 80 ✭✭✭✭✭
Setup Info
    Tags
      Help me…