Legacy API calls in Spark
jmccartin
Registered Posts: 19 ✭✭✭✭
In dataiku's default pyspark recipes, dataiku.spark's get_dataframe takes a sqlContext to return a spark dataframe. This has been a legacy call to the API since Spark 2.0, as the entry point for SQL operations is now via the SparkSession, which has a few subtle, but important differences. While one can create a SparkSession manually, it doesn't appear to work with dataiku's dataframe API.
Please see here for specifics: https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.SQLContext
Please see here for specifics: https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.SQLContext
Answers
-
Hi,
It is indeed a deliberate choice, as a lot of people still use Spark 1.6 (still the default version in some widely-used Hadoop distributions). This way, the recipes you write for one Spark version still work if you switch (via project export, automation node or if your recipe is in a plugin or code sample) to another Dataiku instance that has a different Spark version.
Is there something specific that you can't do with the SQL Context / Spark Context and for which you'd need the Spark Session? -
That's a valid answer that I did not consider (those poor folks still on 1.6!). It's not a major problem, but IRC there are some ways of operating with SQL/Hive tables that will not be future compatible when going through the sqlContext (it will eventually be deprecated).