Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a scenario which is using a Python notebook to execute a step. The notebook runs perfectly by itself, but throws a "module not found" error when it's added to a scenario step: "No module named 'pyspark'". Any suggestions on how to fix this?
Operating system used: RHEL 7.9
Hi @VickeyC
I see that you have also opened a support ticket so I am copying the answer from it:
We confirm that it is not possible to run Pyspark notebooks through the Execute scenario step.
โ
You will need to use a Pyspark recipe instead and a "Build" scenario step to build the output. Note that you can use a "dummy" output in your recipe, i.e. creating a dataset but not โactually writing to it.
Hi @VickeyC
This could be an issue with missing spark jars. Have you run spark-integration on this DSS instance?
@sergeyd, Our Dataiku environment runs on a Hadoop Edge node. We don't use Docker, Elastic AI, or Kubernetes
Hi @VickeyC
Thanks for the details. So have you run spark-integration?
@sergeyd Yes, I believe that we ran that when we installed Dataiku. We have a Spark tab in our admin settings.
Hi @VickeyC
I see that you have also opened a support ticket so I am copying the answer from it:
We confirm that it is not possible to run Pyspark notebooks through the Execute scenario step.
โ
You will need to use a Pyspark recipe instead and a "Build" scenario step to build the output. Note that you can use a "dummy" output in your recipe, i.e. creating a dataset but not โactually writing to it.
Yes, thanks for your help!