Running pyspark query strings from a DSS notebook
Hi there,
I want to:
- Use pyspark (NOT SnowSQL or anything else – Spark is the best solution for the work I'm doing) to query a table in Snowflake (something like this would be perfect):
https://docs.snowflake.com/en/user-guide/spark-connector-use.html#configuring-the-pyspark-script
- From a notebook (NOT a recipe)
- Specifying tables as query strings, e.g. “select * from <table_name>” (NOT using a manually defined DSS dataset)
I was able to create the SparkSession etc. and got this far:
spark.sql("select * from my_db.my_schema.my_table limit 3")
AnalysisException: The namespace in session catalog must have exactly one name part: my_db.my_schema.my_table;
It didn’t throw a ‘Table or view not found’ error so it can see the table is there; apart from that I don't know what the error means. A Google search didn't return much useful.
So then I turned to the DSS docs and came across this:
https://knowledge.dataiku.com/latest/kb/data-prep/pyspark/index.html
Which has this:
# Point to the DSS datasets
users_ds = dataiku.Dataset("users")
movies_ds = dataiku.Dataset("movies")
ratings_ds = dataiku.Dataset("ratings")
# And read them as a Spark dataframes
users = dkuspark.get_dataframe(sqlContext, users_ds)
movies = dkuspark.get_dataframe(sqlContext, movies_ds)
ratings = dkuspark.get_dataframe(sqlContext, ratings_ds)
Which is not a viable solution as I don’t want to have to define a Dataiku dataset for every table in our database I might want to query (there are 100’s of them). I just want to be able to run a query string and return the results as a Spark dataframe like I would usually do if I weren't using DSS, exactly like I’ve done above.
Is this possible?
Thanks
Operating system used: Linux
Answers
-
Hi
the spark-snowflake integration is documented on Snowflake's site at https://docs.snowflake.com/en/user-guide/spark-connector-use.html#using-the-connector-with-python . In DSS you'll need to have the spark integration checkbox checked on the Snowflake connection.
In essence, instead of doing df = spark.sql(....) like you planned, you need to do df = spark.read.format("snowflake").option("query", ....).load()