Scraping Pyspark jobs without input data sets

zjacobs23 · February 2023

Hello,

I am currently creating PySpark jobs that do not have defined input data sets within my Pyspark notebooks. The tables are executed with spark sql within the actual notebook itself. I am wanting to see if there is a way to access all the tables executed within the spark sql across multiple projects. See screen shot below. The 'SELECT * FROM DB.TABLE' is where I am trying to grab the data sets being used. As you can see in the screen shot there is no inputs within the Pyspark notebook.

Alexandru · April 2023

Hi @zjacobs23
,
In DSS you should ideally add the required datasets as input to your recipe where possible if there are in other projects, then you should use the sharing dataset feature:
https://doc.dataiku.com/dss/latest/security/shared-objects.html

Once added you can read these as per steps described here :
https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html#anatomy-of-a-basic-pyspark-recipe

If you want to execute Spark SQL you can use Spark SQL recipe: : https://doc.dataiku.com/dss/latest/code_recipes/sparksql.html#creating-a-sparksql-recipe

Thanks

Scraping Pyspark jobs without input data sets

Answers

Categories

Setup Info

Tags