Spark packages with DSS ?

Options
q666
q666 Registered Posts: 11 ✭✭✭✭
How can add spark packages so that they will be available in my recipes and notebooks ?
Tagged:

Best Answer

  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker
    Answer ✓
    Options

    You can try setting the spark.jars.packages option for the wanted spark configuration. Beware that it will be applied to all the spark jobs using this configuration (including Spark notebooks), so it might make the startup of these jobs a bit slower. Also, for use with notebooks, you should restart DSS after setting this.

    Generally speaking, to make libraries / jars available to spark notebooks, you have a couple options depending on your use case:

    • From a notebook, you can use the SparkContext’s addJar method
    • On a spark configuration profile, you can set some spark configuration keys for that:
      • spark.jars to specify jars to be made available to the driver and sent to the executors
      • spark.jars.packages to instead specify Maven packages to be downloaded and made available
      • spark.driver.extraClassPath to prepend to the driver’s classpath
      • spark.executor.extraClassPath

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Options
    Hi,

    By Spark packages, I assume you mean either Python (or R) packages that you want to use within PySpark notebooks and recipes.

    There are two cases to distinguish:

    * If you are only using the packages on the driver part of your Spark job (to post-process the output of a Spark DataFrame for example), then you simply need to follow the regular Python packages install procedure: See https://doc.dataiku.com/dss/latest/installation/python.html

    * If you want to use these packages on the executors part (in a function used as a UDF to actually perform processing on the Spark dataframe), then this package must be available in the Python environment used by your executors. By default, that means the system's Python *of each machine of your Spark cluster* - In other words, by default, you would need to "sudo pip install MYPKG" on all machines of your cluster .

    http://spark.apache.org/docs/latest/configuration.html#environment-variables has a bit of details.

    The R instructions would be similar
  • q666
    q666 Registered Posts: 11 ✭✭✭✭
    Options
    by spark-packages i mean http://spark-packages.org/

    in my case i need spark-csv package from databricks
  • q666
    q666 Registered Posts: 11 ✭✭✭✭
    Options
    probably i just need to add the package to PYSPARK_SUBMIT_ARGS but it looks like the argument is overwritten when i'm doing a simple export before runinng ./bin/dss start
  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Options
    Hi,

    Indeed, DSS build its own PYSPARK_SUBMIT_ARGS. Currently, there is no way to directly manipulate the spark-submit command line. All options of spark-submit can also be set by configuration properties (spark.driver*) ... except --packages

    At the moment, you won't be able to use the --packages option. However, note that this option is a shortcut to retrieve+cache jars from Maven. You could retrieve the jars manually by writing a small ivy.xml, and then add the jars to spark.driver.extraClassPath

    A note though: the DSS Spark API already includes the ability to read DSS datasets, including CSV ones, without the need for any additional packages.

    Regards,
  • q666
    q666 Registered Posts: 11 ✭✭✭✭
    Options
    k i see, a temporary solution is to add the $PYSPARK_SUBMIT_ARGS to scripts/linked/dss , would be nice if u would fix it in the next release because there are few cool packages in spark-packages

    dataiku-dss-2.2.1/scripts/linked/dss:133: export PYSPARK_SUBMIT_ARGS="$PYSPARK_SUBMIT_ARGS $pySparkSubmitArgs"
  • Unknown
    Options
    Hi,
    This workaround doesn't work anymore in DSS 4.0 (only applies to notebooks). Do you have any other way to do this? (this is really an important feature for us as we often have the need to import spark packages and having to install jars into cluster is definitely less flexible).
    Thanks!
Setup Info
    Tags
      Help me…