-
PYSPARK_PYTHON environment variable issue in PySpark
Hi I am facing below issue for PySpark recipe. Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. I have set the environment variables using…
-
import packages warnings
When I am importing packages, I get the warnings you see in the pictures. What are they and how can I get rid of them?
-
No connection defined to upload files/jars
I am trying to execute a PySpark recipe on a remote AWS EMR Spark cluster and I am getting: Your Spark settings don't define a temporary storage for yarn-cluster modein act.compute_prepdataset1_NP: No connection defined to upload files/jars I am using this runtime configuration: I also tried adding: spark.yarn.stagingDir…
-
Best Practices For Updating and Renaming Spark and Container Configurations
Hello Dataiku Community, Hope all is well! Our team is looking to implement new Spark and container configuration settings on our instances. We are curious to understand what the best practices are for updating the existing configurations. For context we have existing Spark configurations already being used by end users,…
-
General / Rule of Thumb Spark Configuration Settings
We are using managed spark over kubernetes in EKS. We have about 80 active users on our design node, about 1/2 of them use spark regularly. We've tried to make things easy by creating simple spark configurations but are finding that we continuously are changing configurations. With multiple spark applications, has anyone…
-
How to save Pyspark model from notebook to managed folder
Hi, I'm trying to save pyspark model model.save("/opt/dataiku/design/managed_folders/PROJECT_TEST/9KeBcUKy/ML_SAVED") from notebook to managed folder but I'm getting the following error: Py4JJavaError: An error occurred while calling o2981.save.: org.apache.spark.SparkException: Job aborted. at…
-
SQL Spark integration
Where is the "SQL Spark integration" > "Enable direct access" settings documented? When setting it to "For reads" I get the following error: Invalid connection configuration The driver for the connection needs to be passed manually to Spark Where do pass the driver for the connection manually? Can I pass the Maven…
-
Pyspark and python error
I was trying to execute a Pyspark script and encountered a py4j error. Can someone help me with this? I have checked all the version compatibilities as well. I am attaching a screenshot of the error. Operating system used: Ubuntu Operating system used: Ubuntu
-
PySpark Setup via Dataiku: dkuspark.getdataframe() error
Hi All, I'm just starting out on PySpark (and on Dataiku) and debugging via both Dataiku and PySpark documentation has been quite the challenge. But after a lot of searching, it seems my error may be more isolated to the Dataiku platform. So I want to convert a table from a Redshift/SQL server that I defined in my Dataiku…
-
Accessing Spark web UI
Hello, I am a beginner in Spark and I am trying to setup Spark on our Kubernetes cluster. The cluster is now working and I can run Spark jobs; however, I want to access Spark web UI to inspect how my job is being distributed. We usually port-forward a port(4040), but I am not being able to check which pod is the driver pod…