PySpark Python executables
jmccartin
Registered Posts: 19 ✭✭✭✭
Hi Dataiku,
How exactly is the environment variable PYSPARK_DRIVER_PYTHON being set by DSS? No matter what I put in a Spark configuration or no matter what Python environment I choose, this always defaults to the path of the internal Dataiku python executable (/home/dataiku/dss_data/bin/python).
My goal here is to have everything set by the Spark configuration, so if running on a YARN cluster, the executable will be /usr/bin/python3 (default python3 executable on Linux). Whilst running in local mode, I will have a different Spark configuration that points to the python executable of the kernel (in my case, /home/dataiku/dss_data/code-envs/python/python36).
Why is PYSPARK_DRIVER_PYTHON static? And why does the 'Yarn Python executable' variable under the Code Envs page change the PYSPARK_PYTHON when Spark is running in local mode? I'm forced to having to get everyone in the team to override the environment variables at the start of each recipe, or notebook.
How exactly is the environment variable PYSPARK_DRIVER_PYTHON being set by DSS? No matter what I put in a Spark configuration or no matter what Python environment I choose, this always defaults to the path of the internal Dataiku python executable (/home/dataiku/dss_data/bin/python).
My goal here is to have everything set by the Spark configuration, so if running on a YARN cluster, the executable will be /usr/bin/python3 (default python3 executable on Linux). Whilst running in local mode, I will have a different Spark configuration that points to the python executable of the kernel (in my case, /home/dataiku/dss_data/code-envs/python/python36).
Why is PYSPARK_DRIVER_PYTHON static? And why does the 'Yarn Python executable' variable under the Code Envs page change the PYSPARK_PYTHON when Spark is running in local mode? I'm forced to having to get everyone in the team to override the environment variables at the start of each recipe, or notebook.
Tagged:
Answers
-
Hi,
The PYSPARK_DRIVER_PYTHON variable is automatically set to the path of the Python executable of the code environment running your recipe.
Note that if you add pyspark.python variables to your Spark configuration, this will override the environment variable, so you shouldn't set this.
PYSPARK_PYTHON for the executors can be either set manually or it will be automatically filled if you set the "Yarn Python bin" field in the code environment. It works regardless of the actual Spark mode, since it cannot know it. -
Addressing your points in order:
1) I can confirm that this is NOT the case. It's always the inbuilt DSS python. Always, not matter which kernel I choose.
2) I can see it setting PYSPARK_PYTHON, but not the driver, again due to point 1)
3) Might it be better to rename that 'Yarn Python executable' to PYSPARK_PYTHON then? Because if you're running in local mode, then the executors are running locally, and not on YARN. If the python version is different, you receive a crash (as I have).
I am running DSS 5.1. -
Hi, should I raise a bug report about point 1, if we are in disagreement about how that environment variable is being set?