Plugin cannot find pyspark
I built the dedicated code environment for the plug in and included pyspark in the plugin's code-env/python/spec/requirements.txt file and I still get the error below when trying to run import pyspark.
Installing a recent version of pandas in a plugin's code environment is a kludge: you have to include "corePackagesSet": "PANDAS13" in the desc.json file or the whole code-environment build fails.
Do you have to implement some similar kludge to get the pyspark package installed in the plugin's code environment?
Traceback (most recent call last): File "check_spark_run.py", line 4, in <module> from pyspark.sql import SparkSession ModuleNotFoundError: No module named 'pyspark'
Operating system used: centos
Answers
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Hi @clayms
support on this issue will be facilitated directly through our support portal. Once a solution is provided, if relevant, we'll post it here as well for the purpose of knowledge sharing. -
There is a similar kludge. You have to include "kind": "PYSPARK", in the plugin's recipe.json file.
I was not able to find this documented anywhere.
Also, I was hoping to start Spark from the shell with spark-submit, or a python script file, but doing this causes a bunch of other errors.
Regarding the ModuleNotFound error, I fixed that by executing the python executable that is in the plugin's /bin/ directory. Calling python directly only executes the Design Node's python which may not have the libraries you built in the plugins code environment.