Read as Spark dataframe
I installed Spark in a notebook environment. On creating the new pyspark notebook I get the following starter code:
.../
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
dataset = dataiku.Dataset("name_of_the_dataset")
df = dkuspark.get_dataframe(sqlContext, dataset)
.../
The issue is that I have spark version 3.2.1 and since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset. So I am creating Spark session as follows:
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate() # cluster ip
Therefore running the following line gives me error
df = dkuspark.get_dataframe(sqlContext, dataset)
Error:
Py4JJavaError: An error occurred while calling o32.classForName. : java.lang.ClassNotFoundException: com.dataiku.dip.spark.StdDataikuSparkContext
Best Answer
-
Hi,
the spark submit arguments aren't passing the needed jars to Spark, which means you probably haven't done the integration of Spark with DSS (see https://doc.dataiku.com/dss/latest/spark/installation.html ). On a related note, make sure you don't install pyspark as a package in your code env, since that should be handled by the install-spark-integration script.
Answers
-
Hi,
I did the spark integration with DSS. I am creating a Spark session as mentioned above. I need the updated DSS code to import data as Spark dataframe. I've read the documentation, but I can't seem to find the answer.
-
once you have your Spark SQLContext object, you can simply
import dataiku.spark as dkuspark # Example: Read the descriptor of a Dataiku dataset mydataset = dataiku.Dataset("mydataset") # And read it as a Spark dataframe df = dkuspark.get_dataframe(sqlContext, mydataset)