You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

Running pyspark query strings from a DSS notebook

jective
Level 1
Running pyspark query strings from a DSS notebook

Hi there,

I want to:
- Use pyspark (NOT SnowSQL or anything else – Spark is the best solution for the work I'm doing) to query a table in Snowflake (something like this would be perfect):
https://docs.snowflake.com/en/user-guide/spark-connector-use.html#configuring-the-pyspark-script
- From a notebook (NOT a recipe)
- Specifying tables as query strings, e.g. “select * from <table_name>” (NOT using a manually defined DSS dataset)

I was able to create the SparkSession etc. and got this far:

spark.sql("select * from my_db.my_schema.my_table limit 3")
AnalysisException: The namespace in session catalog must have exactly one name part: my_db.my_schema.my_table;

It didn’t throw a ‘Table or view not found’ error so it can see the table is there; apart from that I don't know what the error means. A Google search didn't return much useful.

So then I turned to the DSS docs and came across this:

https://knowledge.dataiku.com/latest/kb/data-prep/pyspark/index.html

Which has this:

# Point to the DSS datasets
users_ds = dataiku.Dataset("users")
movies_ds = dataiku.Dataset("movies")
ratings_ds = dataiku.Dataset("ratings")

# And read them as a Spark dataframes
users = dkuspark.get_dataframe(sqlContext, users_ds)
movies = dkuspark.get_dataframe(sqlContext, movies_ds)
ratings = dkuspark.get_dataframe(sqlContext, ratings_ds)

Which is not a viable solution as I don’t want to have to define a Dataiku dataset for every table in our database I might want to query (there are 100’s of them). I just want to be able to run a query string and return the results as a Spark dataframe like I would usually do if I weren't using DSS, exactly like I’ve done above.

Is this possible?

Thanks


Operating system used: Linux

1 Reply
fchataigner2
Dataiker
Dataiker

Hi

the spark-snowflake integration is documented on Snowflake's site at https://docs.snowflake.com/en/user-guide/spark-connector-use.html#using-the-connector-with-python . In DSS you'll need to have the spark integration checkbox checked on the Snowflake connection.

In essence, instead of doing df = spark.sql(....) like you planned, you need to do df = spark.read.format("snowflake").option("query", ....).load()