Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Running pyspark query strings from a DSS notebook

Level 1
Running pyspark query strings from a DSS notebook

Hi there,

I want to:
- Use pyspark (NOT SnowSQL or anything else – Spark is the best solution for the work I'm doing) to query a table in Snowflake (something like this would be perfect):
- From a notebook (NOT a recipe)
- Specifying tables as query strings, e.g. “select * from <table_name>” (NOT using a manually defined DSS dataset)

I was able to create the SparkSession etc. and got this far:

spark.sql("select * from my_db.my_schema.my_table limit 3")
AnalysisException: The namespace in session catalog must have exactly one name part: my_db.my_schema.my_table;

It didn’t throw a ‘Table or view not found’ error so it can see the table is there; apart from that I don't know what the error means. A Google search didn't return much useful.

So then I turned to the DSS docs and came across this:

Which has this:

# Point to the DSS datasets
users_ds = dataiku.Dataset("users")
movies_ds = dataiku.Dataset("movies")
ratings_ds = dataiku.Dataset("ratings")

# And read them as a Spark dataframes
users = dkuspark.get_dataframe(sqlContext, users_ds)
movies = dkuspark.get_dataframe(sqlContext, movies_ds)
ratings = dkuspark.get_dataframe(sqlContext, ratings_ds)

Which is not a viable solution as I don’t want to have to define a Dataiku dataset for every table in our database I might want to query (there are 100’s of them). I just want to be able to run a query string and return the results as a Spark dataframe like I would usually do if I weren't using DSS, exactly like I’ve done above.

Is this possible?


Operating system used: Linux

1 Reply


the spark-snowflake integration is documented on Snowflake's site at . In DSS you'll need to have the spark integration checkbox checked on the Snowflake connection.

In essence, instead of doing df = spark.sql(....) like you planned, you need to do df ="snowflake").option("query", ....).load()