Running jobs in parallel

ben_p
ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

Hi everyone,

I have a recipe that triggers many jobs, which takes a long time to complete, running one after the other.

How could I get these jobs to run in parallel in DSS? Is this a simple thing to do?

Ben

Tagged:

Best Answer

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭
    Answer ✓

    Yes, actually everything I said about activities also applies to python recipes. However in case of code recipes you have full control of what's exactly going to happen inside a single activity. This means that if you think some operations inside the recipe could be parallelized it would need to be implemented in your code.

    To achieve that you could use Joblib for example: https://joblib.readthedocs.io/en/latest/

Answers

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭

    Hi Ben,

    Could you tell which recipe are you running? Is your output dataset partitioned?

    Concurrent execution (with limitations) is enabled by DSS by default. As a reminder, whenever you run a recipe a DSS Job is created, this job will consist of one or more activities. An activity is an execution of a recipe per dataset per partition.

    By default 5 concurrent activities can be run in DSS, but this number can be changed on different levels.

    You may find more information on this page:

    https://doc.dataiku.com/dss/latest/flow/limits.html

    Regards,

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

    Hi @Andrey

    Thanks for your reply! I am running a python recipe, I don't have partitioned output - in fact no real data is generated by the code into DSS.

    With this setup and your initial response, is parallel running possible here?

    Ben

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

    Thanks again @Andrey
    this is really helpful, one further question if I may!

    In my case, i am using a python recipe to trigger a rebuild of a whole flow, if I am thinking about the correct, i cannot run anything in parallel? Will adding parallelisation to my for loop cause the jobs to be created and run in parallel too?

    Also, I am setting project variables and looping over them, can this be done with parallelisation, or will it mess things up?

    Ben

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭

    Hi Ben,

    It would be interesting to see what your flow looks like, but in your case, it sounds that it's unlikely that parallelizing the python recipe would help much.

    Normally DSS flow should be built from right to left. It means that if you have a long flow with a resulting dataset that you want to build at the end, then instead of building intermediate steps one by one from left to right you should just build the resulting dataset directly and DSS will figure out what it needs to execute in order to build it. In order to decide how to build a result, DSS will create a DAG where some nodes may not require rebuilding and some could be built in parallel (with respect to the maximum parallelism parameter).

    So if you build your flow from right to left DSS takes care of organizing the computation and parallelizing it where possible.

    Accessing project variables is a fast operation and very unlikely to consume a lot of execution time.

    Before any optimizations, I'd recommend doing some profiling to find a place where your job spends most of the time. For example, if the job creates multiple activities you could check which ones of them are the longest and start analyzing what could be improved in them. Furthermore, if you see that your python recipe is slow (it would help if you could share it) you could simply add some time logging to see where the time is spent.

Setup Info
    Tags
      Help me…