I'm just starting out on PySpark (and on Dataiku) and debugging via both Dataiku and PySpark documentation has been quite the challenge. But after a lot of searching, it seems my error may be more isolated to the Dataiku platform. So I want to convert a table from a Redshift/SQL server that I defined in my Dataiku workflow into a PySpark dataframe. Very simple right? Well...
All of these codes are conjured by default by clicking the PySpark recipe.
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext"
Then I create the necessary context:
"sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)"
Here is where the issue arises:
# Read recipe inputs
"some_table = dataiku.Dataset("some_table")
df = dkuspark.get_dataframe(sqlContext, some_table)"
After running this last "get_dataframe" method, Py4J now throws me the error:
An error occurred while calling o27.classForName. Trace: py4j.Py4JException: Method classForName([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
I spent a couple of hours here trying to understand what could have happened, and it appears that maybe PySpark is having some trouble reading in the column type? Or I'm not sure. I have no idea how this would translate to the Dataiku platform, so any help would be tremendous. Thank you!
This indicates that your setup is incomplete or incorrect. Could you please detail what is your setup, notably whether you are using a Hadoop cluster or elastic Kubernetes compute with our Spark-standalone packages?