Hi, I was looking to get DSS integrated with spark but maybe I'm not quite understanding something. We have a server with pyspark on it, and I think the idea is to get DSS to connect to that, however looking at the DSS documentation it seems like DSS expects spark to be local to the DSS server. Am I missing something?
In all cases Spark needs to be installed on the DSS server. Which does not mean that any Spark daemon has to run there, just that the code is present.
Then depending on the job submission parameter "spark.master", job executors can be run either locally (in which case the whole Spark job runs in a single JVM launched by the backend, so it's not really distributed computing, but apart from that it works the same) or in a Hadoop cluster (in which case the Spark job is driven from a JVM launched by the backend, but all real data processing work is done in several JVMs running on the Hadoop worker nodes).
Of course for the second mode to be possible, the local DSS machine must have access to the Hadoop cluster, and the local Spark installation must have been configured so that it knows how to contact this Hadoop cluster (ie have the Hadoop client code in its classpath, as well as the Hadoop configuration files).
A third mode that is not often used in production would indeed be to have a Spark standalone cluster up and running somewhere, accessible to the DSS server. Then again it should just be a matter of configuring correctly the spark.master configuration variable so that the Spark jobs launched by DSS can run tasks on this cluster.
It's not a mode that we really test but there is no reason it should not work.