Possible to make a Python recipe wait until all inputs are ready?

Solved!
info-rchitect
Level 6
Possible to make a Python recipe wait until all inputs are ready?

Hi,

 

If I have a flow with complex branching and the run times for one side is significantly greater than the other side, the Python recipe will fail because one of the inputs is not ready.  Is it possible to configure a recipe to wait for all inputs to be valid?  I think of this akin to setting threads to join. 

dataiku_blocker_example.png

 

thx


Operating system used: Windows 10

0 Kudos
1 Solution
Turribeach

Dataiku automatically waits for all dependancy inputs to be refreshed before it starts a new recipe so you shouldn't be seeing this problem. Sometimes this logic doesn't work as expected so instead of doing smart builds in one single scenario step you can break down the dataset builds in separate scenario step. So for instance in your case you can have your parametric_limits_* datasets build in one scenario step (say step 1), and then add another step (say step 2) for the upstream dataset(s) which the output recipe depends on. Step 2 will always wait for Step 1 to complete. While this adds scenario complexity it's actually good practice since it breaks down the flow build in smaller steps which means that you could resume from last point of failure if needed by disabling the previous successful steps. 

View solution in original post

4 Replies
Turribeach

Dataiku automatically waits for all dependancy inputs to be refreshed before it starts a new recipe so you shouldn't be seeing this problem. Sometimes this logic doesn't work as expected so instead of doing smart builds in one single scenario step you can break down the dataset builds in separate scenario step. So for instance in your case you can have your parametric_limits_* datasets build in one scenario step (say step 1), and then add another step (say step 2) for the upstream dataset(s) which the output recipe depends on. Step 2 will always wait for Step 1 to complete. While this adds scenario complexity it's actually good practice since it breaks down the flow build in smaller steps which means that you could resume from last point of failure if needed by disabling the previous successful steps. 

info-rchitect
Level 6
Author

@Turribeach We just upgraded to 12.1 (not saying that is the cause) and I took advantage of the new recursive build feature from within a Python recipe.  I did run into the case where one branch finished first and the subsequent recipe failed because the other input branch was not ready.  One thing to note is I almost never take datasets into memory; I nearly always use `SQLExecutor2.exec_recipe_fragment` to handle writing the resultant tables.  Perhaps that is why this occurred...

0 Kudos

That shouldn't impact it. Every dataset has a last build date time so I presume Dataiku checks that against the start of the recursive build to decide whether a dataset has been built or needs to be rebuild. Having said that we have seen issues with smart rebuilds before and for that reason we tend not to use them as we can't predict exactly how the smart rebuild will work across all the different types of datasets. And furthermore since Dataiku doesn't support resuming a scenario run from the last point of failure we have gone further with our design and have spun off scenario steps as scenarios (say 01_Scenario_Step_BuildX, 02_Scenario_Step_BuildY, 03_Scenario_Step_BuildZ). We then have a master scenario that executes 01, 02, 03 (a scenario step can run another scenario). If the master scenario fails we then fix the issue and then we have a choice: either re-run the master scenario or only re-run manully the steps that remain. This has the advantage of not having to disable steps which could be forgotten and left in an incorrect state.

info-rchitect
Level 6
Author

@Turribeach  Thanks a lot for the scenario architecture tips, will come in very handy.

0 Kudos