Spark on Kubernetes creates wrong node type
When running spark jobs on Kubernetes from either code recipes, visual recipes, or jupyter notebooks, the wrong type of node is preferentially created on the Kubernetes cluster for spark to run on.
We have separate "Containerized Execution" configurations for each node type: 2xl, 24xl, and 4gdn-12xl. The "node selector" is set for each one. For the 2xl it is:
nodeSelector:
role: worker
Regular python jobs follow this and the appropriate node is created to run the python code, whether that is running from a code/visual recipe or a notebook. As long as you choose the correct execution configuration in your recipe or jupyter kernel.
Spark ignores this in all cases.
We do have the default Spark config set with `spark.kubernetes.executor.node.selector.role` = `worker`
However, this is just ignored.
What's strange is there will be plenty of room to create the default node (2xl) and spark will preferentially get Kubernetes to create the 4gdn-12xl first, and then the 24xl next.
If I force the cluster to resize with the number of 2xl nodes I need before I start the spark job, then spark will run on those newly create 2xl nodes.
I should not be expected to do this for every single spark job that is ever run.
Also, only a few people have permission to use the 24xl and 4gdn-12xl containerized execution environments, but spark will create these nodes no matter who is running the recipe or notebook that is running spark.
How can I prevent spark from creating whatever node it wants on Kubernetes and force it to only use the 2xl nodes?
Operating system used: centos
Best Answer
-
Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
Hi @clayms
,
As a first step, can you try adding the more general spark.kubernetes.node.selector.role = worker configuration to your spark configuration and then retry the job?
If the selection is still not as expected, can you open a ticket with us and attach:
(1) the output of kubectl describe nodes > node_description.txt
(2) A job diagnostic of the spark job that executes on the wrong nodes
This should allow us to identify if there are any issues with your existing configuration.
Thanks,
Sarina