Initial job has not accepted any resources (Recipes on Kubernetes)

bashayr · January 2022

I use DSS to push execution of visual recipes to containerized execution on Kubernetes cluster(k8s), using Spark as the execution engine. I pushed two images to registry: dku-exec-base and dku-spark-base However, when I run the recipe it takes forever running (creating and deleting pods in k8s), I found this line in Job logs:

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

here pod logs in k8s:

++ id -u
+ myuid=998
++ id -g
+ mygid=996
+ set +e
++ getent passwd 998
+ uidentry=dataiku:x:998:996::/home/dataiku:/bin/bash
+ set -e
+ '[' -z dataiku:x:998:996::/home/dataiku:/bin/bash ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)
+ /bin/java -Dspark.driver.port=37802 -Xms1g -Xmx1g -cp ':/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.10.0.6:37802 --executor-id 105 --cores 1 --app-id spark-application-1643539966991 --hostname 10.244.0.102
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/01/30 11:04:56 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14@dsspredictscorewebnewcustomers-b2jyc7ki-exec-105
22/01/30 11:04:56 INFO SignalUtils: Registered signal handler for TERM
22/01/30 11:04:56 INFO SignalUtils: Registered signal handler for HUP
22/01/30 11:04:56 INFO SignalUtils: Registered signal handler for INT
22/01/30 11:04:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/30 11:04:56 INFO SecurityManager: Changing view acls to: dataiku
22/01/30 11:04:56 INFO SecurityManager: Changing modify acls to: dataiku
22/01/30 11:04:56 INFO SecurityManager: Changing view acls groups to: 
22/01/30 11:04:56 INFO SecurityManager: Changing modify acls groups to: 
22/01/30 11:04:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(dataiku); groups with view permissions: Set(); users  with modify permissions: Set(dataiku); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1780)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:283)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:272)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:303)
        at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
        at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
        at scala.collection.immutable.Range.foreach(Range.scala:158)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:301)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
        ... 4 more
Caused by: java.io.IOException: Failed to connect to /10.10.0.6:37802
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.10.0.6:37802
Caused by: java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)

connectivity between DSS and the cluster work, but what does this mean? What am I missing?

(k8s on digitalocean)

Operating system used: centos os 7

fchataigner2 · January 2022

Hi,

Spark requires that the executors can connect to the driver, and almost all communication is in the direction executor -> driver. If 10.0.0.6 is the IP of the DSS machine, you need to ensure that the cluster nodes (and the pods) can access it. The present error means that your cluster setup is blocking this communication, so you need to add security groups or firewall rules depending on the cloud your cluster is running on.

bashayr · January 2022

Hi @fchataigner2
thanks for your reply

this IP is not the IP of the DSS machine and I disabled the firewall, however I added into bin/env-site.sh

export DKU_BACKEND_EXT_HOST=DSS_IP

after added this spark.driver.host to spark config it works!

Initial job has not accepted any resources (Recipes on Kubernetes)

Answers

Categories

Setup Info

Tags