Running jobs in parallel

ben_p · November 2020

Hi everyone,

I have a recipe that triggers many jobs, which takes a long time to complete, running one after the other.

How could I get these jobs to run in parallel in DSS? Is this a simple thing to do?

Ben

Andrey · November 2020

Yes, actually everything I said about activities also applies to python recipes. However in case of code recipes you have full control of what's exactly going to happen inside a single activity. This means that if you think some operations inside the recipe could be parallelized it would need to be implemented in your code.

To achieve that you could use Joblib for example: https://joblib.readthedocs.io/en/latest/

Andrey · November 2020

Hi Ben,

Could you tell which recipe are you running? Is your output dataset partitioned?

Concurrent execution (with limitations) is enabled by DSS by default. As a reminder, whenever you run a recipe a DSS Job is created, this job will consist of one or more activities. An activity is an execution of a recipe per dataset per partition.

By default 5 concurrent activities can be run in DSS, but this number can be changed on different levels.

You may find more information on this page:

https://doc.dataiku.com/dss/latest/flow/limits.html

Regards,

ben_p · November 2020

Hi @Andrey

Thanks for your reply! I am running a python recipe, I don't have partitioned output - in fact no real data is generated by the code into DSS.

With this setup and your initial response, is parallel running possible here?

Ben

ben_p · November 2020

Thanks again @Andrey
this is really helpful, one further question if I may!

In my case, i am using a python recipe to trigger a rebuild of a whole flow, if I am thinking about the correct, i cannot run anything in parallel? Will adding parallelisation to my for loop cause the jobs to be created and run in parallel too?

Also, I am setting project variables and looping over them, can this be done with parallelisation, or will it mess things up?

Ben

Andrey · November 2020

Hi Ben,

It would be interesting to see what your flow looks like, but in your case, it sounds that it's unlikely that parallelizing the python recipe would help much.

Normally DSS flow should be built from right to left. It means that if you have a long flow with a resulting dataset that you want to build at the end, then instead of building intermediate steps one by one from left to right you should just build the resulting dataset directly and DSS will figure out what it needs to execute in order to build it. In order to decide how to build a result, DSS will create a DAG where some nodes may not require rebuilding and some could be built in parallel (with respect to the maximum parallelism parameter).

So if you build your flow from right to left DSS takes care of organizing the computation and parallelizing it where possible.

Accessing project variables is a fast operation and very unlikely to consume a lot of execution time.

Before any optimizations, I'd recommend doing some profiling to find a place where your job spends most of the time. For example, if the job creates multiple activities you could check which ones of them are the longest and start analyzing what could be improved in them. Furthermore, if you see that your python recipe is slow (it would help if you could share it) you could simply add some time logging to see where the time is spent.

Running jobs in parallel

Best Answer

Answers

Categories

Setup Info

Tags