No connection defined to upload files/jars
I am trying to execute a PySpark recipe on a remote AWS EMR Spark cluster and I am getting:
Your Spark settings don't define a temporary storage for yarn-cluster mode in act.compute_prepdataset1_NP: No connection defined to upload files/jars
I am using this runtime configuration:
I also tried adding:
spark.yarn.stagingDir -> hdfs://ip-172-31-43-168.ec2.internal:8020/user/hadoop/.sparkStaging/
From the command line I can successfully run:
spark-submit --master yarn --deploy-mode cluster --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=1 --conf spark.num.executors=2 --conf spark.yarn.am.memory=1024m --conf spark.yarn.am.cores=1 test_job.py
which mean the communication between the client and the AWS EMR Spark cluster is working fine. I also have S3 and hdfs_root connections working fine.
Thank you!
Operating system used: Amazon Linux 2023
Best Answer
-
All set! I had the wrong jars.
DSS is working perfectly with AWS EMR v6.15.0 and the very latest v7.1.0What an amazing product! I can not say enough good things about the people that developed and continue to develop it.
Answers
-
"yarnClusterSettings":{ "connectionName":"hdfs_root", "location":"/user/hadoop/.sparkStaging/" }
I overcame that problem; this one:
2024-06-03 09:15:54,513 INFO Not running pyspark-over-k8s in cluster mode, not distributing
will be more difficult to overcome. The idea was to run it over the AWS EMR cluster, which I understand is, or will be, deprecated. Not a good decision as far as I am concerned.