Community Conundrum 25: Feature Visualization is now live! Read More

Spark packages with DSS ?

Level 1
Spark packages with DSS ?
How can add spark packages so that they will be available in my recipes and notebooks ?
0 Kudos
7 Replies
Dataiker
Dataiker
Hi,

By Spark packages, I assume you mean either Python (or R) packages that you want to use within PySpark notebooks and recipes.

There are two cases to distinguish:

* If you are only using the packages on the driver part of your Spark job (to post-process the output of a Spark DataFrame for example), then you simply need to follow the regular Python packages install procedure: See https://doc.dataiku.com/dss/latest/installation/python.html

* If you want to use these packages on the executors part (in a function used as a UDF to actually perform processing on the Spark dataframe), then this package must be available in the Python environment used by your executors. By default, that means the system's Python *of each machine of your Spark cluster* - In other words, by default, you would need to "sudo pip install MYPKG" on all machines of your cluster .

http://spark.apache.org/docs/latest/configuration.html#environment-variables has a bit of details.

The R instructions would be similar
Level 1
Author
by spark-packages i mean http://spark-packages.org/

in my case i need spark-csv package from databricks
0 Kudos
Level 1
Author
probably i just need to add the package to PYSPARK_SUBMIT_ARGS but it looks like the argument is overwritten when i'm doing a simple export before runinng ./bin/dss start
0 Kudos
Dataiker
Dataiker
Hi,

Indeed, DSS build its own PYSPARK_SUBMIT_ARGS. Currently, there is no way to directly manipulate the spark-submit command line. All options of spark-submit can also be set by configuration properties (spark.driver*) ... except --packages

At the moment, you won't be able to use the --packages option. However, note that this option is a shortcut to retrieve+cache jars from Maven. You could retrieve the jars manually by writing a small ivy.xml, and then add the jars to spark.driver.extraClassPath

A note though: the DSS Spark API already includes the ability to read DSS datasets, including CSV ones, without the need for any additional packages.

Regards,
0 Kudos
Level 1
Author
k i see, a temporary solution is to add the $PYSPARK_SUBMIT_ARGS to scripts/linked/dss , would be nice if u would fix it in the next release because there are few cool packages in spark-packages

dataiku-dss-2.2.1/scripts/linked/dss:133: export PYSPARK_SUBMIT_ARGS="$PYSPARK_SUBMIT_ARGS $pySparkSubmitArgs"
0 Kudos
Hi,
This workaround doesn't work anymore in DSS 4.0 (only applies to notebooks). Do you have any other way to do this? (this is really an important feature for us as we often have the need to import spark packages and having to install jars into cluster is definitely less flexible).
Thanks!
0 Kudos
Dataiker
Dataiker

You can try setting the spark.jars.packages option for the wanted spark configuration. Beware that it will be applied to all the spark jobs using this configuration (including Spark notebooks), so it might make the startup of these jobs a bit slower. Also, for use with notebooks, you should restart DSS after setting this.



Generally speaking, to make libraries / jars available to spark notebooks, you have a couple options depending on your use case:




  • From a notebook, you can use the SparkContext’s addJar method

  • On a spark configuration profile, you can set some spark configuration keys for that:

    • spark.jars to specify jars to be made available to the driver and sent to the executors

    • spark.jars.packages to instead specify Maven packages to be downloaded and made available

    • spark.driver.extraClassPath to prepend to the driver’s classpath

    • spark.executor.extraClassPath



0 Kudos
Labels (1)