Running jobs in parallel

Solved!
ben_p
Level 5
Running jobs in parallel

Hi everyone,

I have a recipe that triggers many jobs, which takes a long time to complete, running one after the other.

How could I get these jobs to run in parallel in DSS? Is this a simple thing to do?

Ben

0 Kudos
1 Solution
Andrey
Dataiker Alumni

Yes, actually everything I said about activities also applies to python recipes. However in case of code recipes you have full control of what's exactly going to happen inside a single activity. This means that if you think some operations inside the recipe could be parallelized it would need to be implemented in your code.

To achieve that you could use Joblib for example: https://joblib.readthedocs.io/en/latest/

 

Andrey Avtomonov
R&D Engineer @ Dataiku

View solution in original post

5 Replies
Andrey
Dataiker Alumni

Hi Ben,

Could you tell which recipe are you running? Is your output dataset partitioned? 

Concurrent execution (with limitations) is enabled by DSS by default. As a reminder, whenever you run a recipe a DSS Job is created, this job will consist of one or more activities. An activity is an execution of a recipe per dataset per partition. 

By default 5 concurrent activities can be run in DSS, but this number can be changed on different levels.

You may find more information on this page:

https://doc.dataiku.com/dss/latest/flow/limits.html

Regards,

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
ben_p
Level 5
Author

Hi @Andrey 

Thanks for your reply! I am running a python recipe, I don't have partitioned output - in fact no real data is generated by the code into DSS.

With this setup and your initial response, is parallel running possible here?

Ben

0 Kudos
Andrey
Dataiker Alumni

Yes, actually everything I said about activities also applies to python recipes. However in case of code recipes you have full control of what's exactly going to happen inside a single activity. This means that if you think some operations inside the recipe could be parallelized it would need to be implemented in your code.

To achieve that you could use Joblib for example: https://joblib.readthedocs.io/en/latest/

 

Andrey Avtomonov
R&D Engineer @ Dataiku
ben_p
Level 5
Author

Thanks again @Andrey this is really helpful, one further question if I may!

In my case, i am using a python recipe to trigger a rebuild of a whole flow, if I am thinking about the correct, i cannot run anything in parallel? Will adding parallelisation to my for loop cause the jobs to be created and run in parallel too?

Also, I am setting project variables and looping over them, can this be done with parallelisation, or will it mess things up?

Ben

0 Kudos
Andrey
Dataiker Alumni

Hi Ben,

It would be interesting to see what your flow looks like, but in your case, it sounds that it's unlikely that parallelizing the python recipe would help much.

Normally DSS flow should be built from right to left. It means that if you have a long flow with a resulting dataset that you want to build at the end, then instead of building intermediate steps one by one from left to right you should just build the resulting dataset directly and DSS will figure out what it needs to execute in order to build it. In order to decide how to build a result, DSS will create a DAG where some nodes may not require rebuilding and some could be built in parallel (with respect to the maximum parallelism parameter).

So if you build your flow from right to left DSS takes care of organizing the computation and parallelizing it where possible. 

Accessing project variables is a fast operation and very unlikely to consume a lot of execution time.

Before any optimizations, I'd recommend doing some profiling to find a place where your job spends most of the time. For example, if the job creates multiple activities you could check which ones of them are the longest and start analyzing what could be improved in them. Furthermore, if you see that your python recipe is slow (it would help if you could share it) you could simply add some time logging to see where the time is spent. 

 

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos