Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Read as Spark dataframe

Solved!
Dawood154
Level 1
Read as Spark dataframe

I installed Spark in a notebook environment. On creating the new pyspark notebook I get the following starter code:

.../

from pyspark import SparkContext

from pyspark.sql import SQLContext

sc = SparkContext()

sqlContext = SQLContext(sc)

dataset = dataiku.Dataset("name_of_the_dataset")

df = dkuspark.get_dataframe(sqlContext, dataset)

 

.../

The issue is that I have spark version 3.2.1 and since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset. So I am creating Spark session as follows:

spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate() # cluster ip

Therefore running the following line gives me error

df = dkuspark.get_dataframe(sqlContext, dataset)

 

Error:

Py4JJavaError: An error occurred while calling o32.classForName. : java.lang.ClassNotFoundException: com.dataiku.dip.spark.StdDataikuSparkContext

 

0 Kudos
1 Solution
fchataigner2
Dataiker
Dataiker

Hi,

the spark submit arguments aren't passing the needed jars to Spark, which means you probably haven't done the integration of Spark with DSS (see https://doc.dataiku.com/dss/latest/spark/installation.html ). On a related note, make sure you don't install pyspark as a package in your code env, since that should be handled by the install-spark-integration script.

View solution in original post

3 Replies
fchataigner2
Dataiker
Dataiker

Hi,

the spark submit arguments aren't passing the needed jars to Spark, which means you probably haven't done the integration of Spark with DSS (see https://doc.dataiku.com/dss/latest/spark/installation.html ). On a related note, make sure you don't install pyspark as a package in your code env, since that should be handled by the install-spark-integration script.

Dawood154
Level 1
Author

Hi,

I did the spark integration with DSS. I am creating a Spark session as mentioned above. I need the updated DSS code to import data as Spark dataframe. I've read the documentation, but I can't seem to find the answer.

0 Kudos
fchataigner2
Dataiker
Dataiker

once you have your Spark SQLContext  object, you can simply

import dataiku.spark as dkuspark
# Example: Read the descriptor of a Dataiku dataset
mydataset = dataiku.Dataset("mydataset")
# And read it as a Spark dataframe
df = dkuspark.get_dataframe(sqlContext, mydataset)
0 Kudos