How to re-run failed scenarios from the last point of failure

Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,737 Neuron
We have some complex scenarios and sometimes these would fail after hours of running some previous steps. We would fix the issue in code and then we would want to resume the scenario run from the last point of failure rather than running the whole scenario again from scratch. However this feature is not present in Dataiku as scenarios always run from scratch. While it's possible to toggle enable/disable scenario steps to run specific scenario steps we don't like this option for several reasons:
  1. It's actually changing the scenario so if the person doing it "forgets" to enable the scenario steps again this could lead to another incident / incorrect data / more failures
  2. As this is an actual change we are not allowed to this in our Production environment unless we have a change request
  3. We found a bug that the toggle enable/disable scenario steps that causes the 'if condition satisfied' option to reset (Ticket #27132)

Another option you could use to not have to rebuild all your datasets is to use "Build when required" on your dataset steps. We found that this works OK for non-SQL datasets but for SQL datasets it's not consistent and it's difficult to predict when a dataset will be rebuild. Besides fundamentally we think it's the wrong approach to delegate the decision on the "Build when required" functionality. If a scenario step has completed successfully the shouldn't be anything else at play when re-running a failed scenario from the last point of failure.

So we have gone with the following design approach for our scenarios:
  • Rather than having a single scenario with multiple steps we converted every step into a separate scenario
  • Each scenario-step scenario is numbered sequentially accordingly (1_, 2_, 2_1, 2_2, 3, etc) so we always know the sequence of execution
  • We then created a "master" scenario that executes all the scenario-step scenarios using a run scenario step
  • The "master" scenario is then scheduled or triggered as needed
  • If there is failure on the "master" scenario, we will investigate the issue, fix it and then manually re-run the failed scenario-step scenario and manually run any remaining scenario-step scenarios in the correct sequence

The final piece of the puzzle is a webapp that we are going to deploy to our Automation Production node that will allow our Production Support people to re-run scenarios having only a Reader license which means we have a guarantee that they will not be able to make any project changes unless they use a privileged account which can only be accessed when a change request is approved and active.

This will greatly increase our Operational recovery on scenario failures.
While clever the above solution requires additional work to breakdown a scenario into scenario-steps and is harder to maintain and understand. It also requires a custom WebApp to be able to run scenarios without giving access to make code changes. So it will be great if Dataiku added functionality that allowed the following:
  1. Running a scenario from the last point of failure (ie ignore steps that previously succeeded on a failed run)
  2. Allow Readers or Explorers licenses the privilege to execute Scenarios without having to have a WebApp or a Dashboard or any other write privilege
Setup Info
      Help me…