Write data IMMEDIATELY to a dataiku dataset from a Python recipe

Options
emher
emher Registered Posts: 32 ✭✭✭✭✭

I have a collection of triggers that triggers the same scenario for different partitions. Hence these jobs will execute in parallel. However, I would like to avoid triggering an already running partition (this will result in a scenario failure). My initial ideas was to,

* Write RUNNING to a status dataset in the beginning of the Python recipe

* Do data processing

* Write COMPLETED to the status dataset in the end of the Python recipe

and then check the status dataset as part of my trigger condition, i.e. if the status is RUNNING, don't fire the trigger.

However, it seems that the write operations are delayed by dataiku to the end of the recipe so that the RUNNING status is never written (unless the data processing step fails). Is this correct? If so, is there any way to work around it?

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer Posts: 753 Dataiker
    Options

    Hi,

    First of all, it's important to note that a single scenario cannot run multiple times in parallel.

    If the scenario is already running, another trigger (for the same scenario) firing while it is running will either (depending on the configuration of the scenario):

    • Not happen because by default, triggers don't run while a scenario runs
    • Be ignored
    • Enqueue another run of the scenario that will run after the first one.

    This will not create another run of the same scenario running concurrently.

    Please also note that when a scenario runs:

    • The trigger fires
    • Then the scenario starts (possibly after a grace delay, if configured)
    • Then the "build" step (assuming it's the first) starts
    • The flow graph is traversed, and the list of recipes to run is computed
    • The first recipe starts
    • If it's a Python recipe, the Python code starts

    Thus, if you have a concurrency issue, writing a "running" status at the beginning at the recipe would still leave a large window of opportunity (between "trigger fires" and "your Python code starts").

    For your original question, assuming that you use "write_with_schema" or that you close the writer if you use "get_writer", write will happen immediately and will not be delayed to the end of the recipe.

  • emher
    emher Registered Posts: 32 ✭✭✭✭✭
    Options

    Hi Clement,

    I am aware that parallel execution is not possible to non-partitioned data, but I am pretty sure that it is possible for partitioned data sets. Here is how it looks when I run for four partitions,

    parallel.pngYour point about the "large window of opportunity" I am leaving open with my current design is a very good one though. I will probably have to rethink the design a bit, i.e. inserting the RUNNING statement a an earlier point.

Setup Info
    Tags
      Help me…