PySpark Setup via Dataiku: dkuspark.getdataframe() error

ewminner · August 2021

Hi All,

I'm just starting out on PySpark (and on Dataiku) and debugging via both Dataiku and PySpark documentation has been quite the challenge. But after a lot of searching, it seems my error may be more isolated to the Dataiku platform. So I want to convert a table from a Redshift/SQL server that I defined in my Dataiku workflow into a PySpark dataframe. Very simple right? Well...

All of these codes are conjured by default by clicking the PySpark recipe.

First importing:

"import pyspark
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext"

Then I create the necessary context:

"sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)"

Here is where the issue arises:

# Read recipe inputs
"some_table = dataiku.Dataset("some_table")

df = dkuspark.get_dataframe(sqlContext, some_table)"

After running this last "get_dataframe" method, Py4J now throws me the error:

An error occurred while calling o27.classForName. Trace:
py4j.Py4JException: Method classForName([class java.lang.String]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

I spent a couple of hours here trying to understand what could have happened, and it appears that maybe PySpark is having some trouble reading in the column type? Or I'm not sure. I have no idea how this would translate to the Dataiku platform, so any help would be tremendous. Thank you!

Clément_Stenac · August 2021

Hi,

This indicates that your setup is incomplete or incorrect. Could you please detail what is your setup, notably whether you are using a Hadoop cluster or elastic Kubernetes compute with our Spark-standalone packages?

ewminner · August 2021

Thanks for the quick response! How would you recommend that I would gather this info?

Clément_Stenac · August 2021

If you are using our Dataiku Online offer, please click on the chat icon in the bottom right to investigate further your issue
If you are using a self-hosted version of Dataiku and you are not the person who setup your Dataiku instance, we'd recommend talking to this person who can look into this, and, if needed, open a support ticket with us to go more into the details of your setup (https://doc.dataiku.com/dss/latest/troubleshooting/obtaining-support.html)
If you are the person who setup and manage this Dataiku instance, there are some steps needed to setup Spark. These depend on your environment, infrastructure, how Dataiku will interact with your computation environment, ... If you are a Dataiku customer or prospect, please feel free to get in touch with your Dataiku account team who can provide assistance on this.

ewminner · August 2021

Thanks a lot for the info! I'll be in touch

PySpark Setup via Dataiku: dkuspark.getdataframe() error

Answers

Categories

Setup Info

Tags