No connection defined to upload files/jars

azorman
azorman Registered Posts: 3
edited July 16 in Setup & Configuration

I am trying to execute a PySpark recipe on a remote AWS EMR Spark cluster and I am getting:

Your Spark settings don't define a temporary storage for yarn-cluster mode
in act.compute_prepdataset1_NP: No connection defined to upload files/jars

I am using this runtime configuration:
Screenshot 2024-06-01 at 01.02.59.png

I also tried adding:

spark.yarn.stagingDir -> hdfs://ip-172-31-43-168.ec2.internal:8020/user/hadoop/.sparkStaging/

From the command line I can successfully run:

spark-submit --master yarn --deploy-mode cluster --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=1 --conf spark.num.executors=2 --conf spark.yarn.am.memory=1024m --conf spark.yarn.am.cores=1 test_job.py

which mean the communication between the client and the AWS EMR Spark cluster is working fine. I also have S3 and hdfs_root connections working fine.

Thank you!


Operating system used: Amazon Linux 2023

Tagged:

Best Answer

  • azorman
    azorman Registered Posts: 3
    Answer ✓

    All set! I had the wrong jars.

    DSS is working perfectly with AWS EMR v6.15.0 and the very latest v7.1.0

    What an amazing product! I can not say enough good things about the people that developed and continue to develop it.

Answers

  • azorman
    azorman Registered Posts: 3
    edited July 17
    "yarnClusterSettings":{
       "connectionName":"hdfs_root",
       "location":"/user/hadoop/.sparkStaging/"
    }

    I overcame that problem; this one:

    2024-06-03 09:15:54,513 INFO Not running pyspark-over-k8s in cluster mode, not distributing

    will be more difficult to overcome. The idea was to run it over the AWS EMR cluster, which I understand is, or will be, deprecated. Not a good decision as far as I am concerned.

Setup Info
    Tags
      Help me…