Legacy API calls in Spark

Registered Posts: 19 ✭✭✭✭
In dataiku's default pyspark recipes, dataiku.spark's get_dataframe takes a sqlContext to return a spark dataframe. This has been a legacy call to the API since Spark 2.0, as the entry point for SQL operations is now via the SparkSession, which has a few subtle, but important differences. While one can create a SparkSession manually, it doesn't appear to work with dataiku's dataframe API.

Please see here for specifics: https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.SQLContext

Answers

  • Dataiker, Alpha Tester Posts: 196 Dataiker
    Hi,

    It is indeed a deliberate choice, as a lot of people still use Spark 1.6 (still the default version in some widely-used Hadoop distributions). This way, the recipes you write for one Spark version still work if you switch (via project export, automation node or if your recipe is in a plugin or code sample) to another Dataiku instance that has a different Spark version.

    Is there something specific that you can't do with the SQL Context / Spark Context and for which you'd need the Spark Session?
  • Registered Posts: 19 ✭✭✭✭
    That's a valid answer that I did not consider (those poor folks still on 1.6!). It's not a major problem, but IRC there are some ways of operating with SQL/Hive tables that will not be future compatible when going through the sqlContext (it will eventually be deprecated).

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.