Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on August 24, 2022 1:00PM
Likes: 0
Replies: 3
I installed Spark in a notebook environment. On creating the new pyspark notebook I get the following starter code:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
dataset = dataiku.Dataset("name_of_the_dataset")
df = dkuspark.get_dataframe(sqlContext, dataset)
The issue is that I have spark version 3.2.1 and since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset. So I am creating Spark session as follows:
spark = SparkSession.builder.master("local[1]").appName("").getOrCreate() # cluster ip
Therefore running the following line gives me error
df = dkuspark.get_dataframe(sqlContext, dataset)
Py4JJavaError: An error occurred while calling o32.classForName. : java.lang.ClassNotFoundException: com.dataiku.dip.spark.StdDataikuSparkContext
the spark submit arguments aren't passing the needed jars to Spark, which means you probably haven't done the integration of Spark with DSS (see ). On a related note, make sure you don't install pyspark as a package in your code env, since that should be handled by the install-spark-integration script.
I did the spark integration with DSS. I am creating a Spark session as mentioned above. I need the updated DSS code to import data as Spark dataframe. I've read the documentation, but I can't seem to find the answer.
once you have your Spark SQLContext object, you can simply
import dataiku.spark as dkuspark # Example: Read the descriptor of a Dataiku dataset mydataset = dataiku.Dataset("mydataset") # And read it as a Spark dataframe df = dkuspark.get_dataframe(sqlContext, mydataset)