Scraping Pyspark jobs without input data sets
Hello,
I am currently creating PySpark jobs that do not have defined input data sets within my Pyspark notebooks. The tables are executed with spark sql within the actual notebook itself. I am wanting to see if there is a way to access all the tables executed within the spark sql across multiple projects. See screen shot below. The 'SELECT * FROM DB.TABLE' is where I am trying to grab the data sets being used. As you can see in the screen shot there is no inputs within the Pyspark notebook.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @zjacobs23
,
In DSS you should ideally add the required datasets as input to your recipe where possible if there are in other projects, then you should use the sharing dataset feature:
https://doc.dataiku.com/dss/latest/security/shared-objects.html
Once added you can read these as per steps described here :
https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html#anatomy-of-a-basic-pyspark-recipe
If you want to execute Spark SQL you can use Spark SQL recipe: : https://doc.dataiku.com/dss/latest/code_recipes/sparksql.html#creating-a-sparksql-recipe
Thanks