Scraping Pyspark jobs without input data sets

zjacobs23
Level 1
Scraping Pyspark jobs without input data sets

Hello,

I am currently creating PySpark jobs that do not have defined input data sets within my Pyspark notebooks. The tables are executed with spark sql within the actual notebook itself. I am wanting to see if there is a way to access all the tables executed within the spark sql across multiple projects. See screen shot below. The 'SELECT * FROM DB.TABLE' is where I am trying to grab the data sets being used. As you can see in the screen shot there is no inputs within the Pyspark notebook.

 

0 Kudos
1 Reply
AlexT
Dataiker

Hi @zjacobs23 ,
In DSS you should ideally  add the required datasets as input to your recipe where possible if there are in other projects, then you should use the sharing dataset feature:
https://doc.dataiku.com/dss/latest/security/shared-objects.html

Once added you can read these as per steps described here :
https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html#anatomy-of-a-basic-pyspark-recipe

If you want to execute Spark SQL you can use Spark SQL recipe: : https://doc.dataiku.com/dss/latest/code_recipes/sparksql.html#creating-a-sparksql-recipe 

Thanks




0 Kudos