Spark pipeline is running max 8 or 9 spark tasks whereas spark recipe uses full culster

AshishM Registered Posts: 4 ✭✭✭✭
edited June 28 in Setup & Configuration

spark tasks in a single spark job,

if I run single spark recipe it runs as per the rule of block size i.e creating one task for 128 mb of block.

but if i run same spark job with spark pipeline it runs only 8/9 tasks (not more than this) no matter how big the cluster i choose, this information is noted from spark ui (we have 20 nodes of cluster but spark pipeline uses only 2 nodes meanwhile if we run same job without pipeline it uses whole cluster)

spark pipeline (spark ui) image

Spark single recipe :

As seen from above images, while running recipes in spark pipeline it runs only 8/9 tasks while in normal spark recipe it uses whole cluster according to data size and block size


Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Answer ✓

    From your screenshots, we can see that your Spark stage is properly parallelizable even in pipeline mode since it has 102 tasks. Scheduling tasks within a stage is not something that DSS has a say on, it's handled by Spark and YARN. It is very possible that your cluster or queue had some restriction at that time.

    Please also note that in a pipeline, DSS will use the Spark configuration of the "latest" task of the pipeline. If they don't use the same number of executors, that could explain the difference. You should have a look at the executors page of your Spark UI to see if you ave the same number of executors.

    Also, if you have dynamic allocation, behavior can be less predictable, especially on such extremely short jobs.

    Please note that this community Q&A is more suited to generic questions rather than support questions about particular jobs. You can use the support portal for support questions about particular jobs.
Setup Info
      Help me…