Plugin cannot find pyspark

clayms
clayms Registered Posts: 52 ✭✭✭✭

I built the dedicated code environment for the plug in and included pyspark in the plugin's code-env/python/spec/requirements.txt file and I still get the error below when trying to run import pyspark.

Installing a recent version of pandas in a plugin's code environment is a kludge: you have to include "corePackagesSet": "PANDAS13" in the desc.json file or the whole code-environment build fails.

Do you have to implement some similar kludge to get the pyspark package installed in the plugin's code environment?

Traceback (most recent call last):
  File "check_spark_run.py", line 4, in <module>
    from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'

Operating system used: centos

Tagged:

Answers

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi @clayms
    support on this issue will be facilitated directly through our support portal. Once a solution is provided, if relevant, we'll post it here as well for the purpose of knowledge sharing.

  • clayms
    clayms Registered Posts: 52 ✭✭✭✭

    There is a similar kludge. You have to include "kind": "PYSPARK", in the plugin's recipe.json file.

    I was not able to find this documented anywhere.

    Also, I was hoping to start Spark from the shell with spark-submit, or a python script file, but doing this causes a bunch of other errors.

    Regarding the ModuleNotFound error, I fixed that by executing the python executable that is in the plugin's /bin/ directory. Calling python directly only executes the Design Node's python which may not have the libraries you built in the plugins code environment.

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Thank you for sharing your feedback and solution with the rest of the community @clayms
    !

Setup Info
    Tags
      Help me…