Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello,
I am currently creating PySpark jobs that do not have defined input data sets within my Pyspark notebooks. The tables are executed with spark sql within the actual notebook itself. I am wanting to see if there is a way to access all the tables executed within the spark sql across multiple projects. See screen shot below. The 'SELECT * FROM DB.TABLE' is where I am trying to grab the data sets being used. As you can see in the screen shot there is no inputs within the Pyspark notebook.
Hi @zjacobs23 ,
In DSS you should ideally add the required datasets as input to your recipe where possible if there are in other projects, then you should use the sharing dataset feature:
https://doc.dataiku.com/dss/latest/security/shared-objects.html
Once added you can read these as per steps described here :
https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html#anatomy-of-a-basic-pyspark-recipe
If you want to execute Spark SQL you can use Spark SQL recipe: : https://doc.dataiku.com/dss/latest/code_recipes/sparksql.html#creating-a-sparksql-recipe
Thanks