Initial job has not accepted any resources (Recipes on Kubernetes)

Options
bashayr
bashayr Registered Posts: 3 ✭✭✭✭

I use DSS to push execution of visual recipes to containerized execution on Kubernetes cluster(k8s), using Spark as the execution engine. I pushed two images to registry: dku-exec-base and dku-spark-base However, when I run the recipe it takes forever running (creating and deleting pods in k8s), I found this line in Job logs:

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

here pod logs in k8s:

++ id -u
+ myuid=998
++ id -g
+ mygid=996
+ set +e
++ getent passwd 998
+ uidentry=dataiku:x:998:996::/home/dataiku:/bin/bash
+ set -e
+ '[' -z dataiku:x:998:996::/home/dataiku:/bin/bash ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)
+ /bin/java -Dspark.driver.port=37802 -Xms1g -Xmx1g -cp ':/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.10.0.6:37802 --executor-id 105 --cores 1 --app-id spark-application-1643539966991 --hostname 10.244.0.102
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/01/30 11:04:56 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14@dsspredictscorewebnewcustomers-b2jyc7ki-exec-105
22/01/30 11:04:56 INFO SignalUtils: Registered signal handler for TERM
22/01/30 11:04:56 INFO SignalUtils: Registered signal handler for HUP
22/01/30 11:04:56 INFO SignalUtils: Registered signal handler for INT
22/01/30 11:04:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/30 11:04:56 INFO SecurityManager: Changing view acls to: dataiku
22/01/30 11:04:56 INFO SecurityManager: Changing modify acls to: dataiku
22/01/30 11:04:56 INFO SecurityManager: Changing view acls groups to: 
22/01/30 11:04:56 INFO SecurityManager: Changing modify acls groups to: 
22/01/30 11:04:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(dataiku); groups with view permissions: Set(); users  with modify permissions: Set(dataiku); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1780)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:283)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:272)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:303)
        at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
        at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
        at scala.collection.immutable.Range.foreach(Range.scala:158)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:301)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
        ... 4 more
Caused by: java.io.IOException: Failed to connect to /10.10.0.6:37802
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.10.0.6:37802
Caused by: java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)

connectivity between DSS and the cluster work, but what does this mean? What am I missing?

(k8s on digitalocean)


Operating system used: centos os 7

Tagged:

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    Hi,

    Spark requires that the executors can connect to the driver, and almost all communication is in the direction executor -> driver. If 10.0.0.6 is the IP of the DSS machine, you need to ensure that the cluster nodes (and the pods) can access it. The present error means that your cluster setup is blocking this communication, so you need to add security groups or firewall rules depending on the cloud your cluster is running on.

  • bashayr
    bashayr Registered Posts: 3 ✭✭✭✭
    edited July 17
    Options

    Hi @fchataigner2
    thanks for your reply

    this IP is not the IP of the DSS machine and I disabled the firewall, however I added into bin/env-site.sh

    export DKU_BACKEND_EXT_HOST=DSS_IP 

    after added this spark.driver.host to spark config it works!

Setup Info
    Tags
      Help me…