How to execute SQL in PySpark
I'm using Python recipe to find max dates from a list of tables from Hive. Code below is working but runs into random memory issue (vertex failed issue).
I would like to switch to PySpark to address memory issue. I couldn't figure out how to run SQLs in PySpark. Any pointers would be helpful.
Runs in Python recipe
from dataiku.core.sql import SQLExecutor2
executor = SQLExecutor2(dataset=dataiku.Dataset("dummy_input_table"))
max_dt = executor.query_to_df("Select max(process_date) from dummy_input_table")
Tried this in PySpark recipe
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
dkuspark.get_dateframe(sc, dynamic_sql)
# getting this error <class 'AttributeError'>: module 'dataiku.spark' has no attribute 'get_dateframe'
Answers
-
Hi,
Given your issue, have you tried using a visual recipe ? A group recipe can compute the minimum and the maximum of a column, leveraging all computation engines. To do so, just create a group recipe from dummy_dataset (you'll have to enter a "Group by" column in the recipe creation screen, but you'll later be able to remove it). In the recipe settings, remove the grouping column and select min and max aggregation for the process_date column. Finally, just select your execution engine (Spark engine if you want to use Spark) and run the recipe.
When it comes to your code, you won't be able to use the SQL executor, which is meant to use python to generate a dynamic sql query, and then outsource the query execution to the SQL database where the data is stored (which corresponds to using the SQL engine in a visual recipe). If you want to use spark, then you want to use the spark engine to do your computations. To do so, you can use either a pyspark recipe to write instructions in python, or sparkSQL recipe to write instructions in SQL. Both will be translated to spark code in the end, so you don't need to use pyspark to run sql on spark.
To come back to your code, the issue you link is due to a typo in your code, the method you need to call is get_dataframe, hence why you get an error saying that the method get_dateframe doesn't exist.
Hoping to have clarified what you can do to answer your need.
Best,