Read as Spark dataframe

Options
Dawood154
Dawood154 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 9 ✭✭✭

I installed Spark in a notebook environment. On creating the new pyspark notebook I get the following starter code:

.../

from pyspark import SparkContext

from pyspark.sql import SQLContext

sc = SparkContext()

sqlContext = SQLContext(sc)

dataset = dataiku.Dataset("name_of_the_dataset")

df = dkuspark.get_dataframe(sqlContext, dataset)

.../

The issue is that I have spark version 3.2.1 and since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset. So I am creating Spark session as follows:

spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate() # cluster ip

Therefore running the following line gives me error

df = dkuspark.get_dataframe(sqlContext, dataset)

Error:

Py4JJavaError: An error occurred while calling o32.classForName. : java.lang.ClassNotFoundException: com.dataiku.dip.spark.StdDataikuSparkContext

Tagged:

Best Answer

Answers

  • Dawood154
    Dawood154 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 9 ✭✭✭
    Options

    Hi,

    I did the spark integration with DSS. I am creating a Spark session as mentioned above. I need the updated DSS code to import data as Spark dataframe. I've read the documentation, but I can't seem to find the answer.

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    edited July 17
    Options

    once you have your Spark SQLContext object, you can simply

    import dataiku.spark as dkuspark
    # Example: Read the descriptor of a Dataiku dataset
    mydataset = dataiku.Dataset("mydataset")
    # And read it as a Spark dataframe
    df = dkuspark.get_dataframe(sqlContext, mydataset)
Setup Info
    Tags
      Help me…