DSS Engine vs Spark in Sync Recipe
We are changing all the recipes from using the DSS engine to Spark to reduce the memory impact of the jobs.
However, we have an issue with some jobs that when using Spark, the job crashes and so users still use the DSS engine.
Reading on the documentation of "Sync", I see that there are multiple engines but how should I know when to use DSS engine and when to use Spark?
Also, why Spark is not working with some users?
Edit: We are using S3 datasets.
Answers
-
Hi,
A first important thing to note is that using Spark will use more memory than using the DSS engine. Of course, using Spark means that if the proper conditions are met, you will benefit from paralellism, but Spark is much more memory-heavy, including on the DSS host. The DSS engine never loads datasets in memory and is very lean in memory (but it often slower).
Generally speaking, sync recipes will almost always be slower and more brittle with Spark than with DSS engine, so we almost never recommend using Spark engine for sync recipe. The reason is that you will usually almost never "from HDFS to HDFS" or "from S3 to S3" (or such), i.e. the "good cases" for Spark. Most syncs will be between completely disparate storage kinds (because that's why you sync in the first place) and hence you lose most of the benefits of Spark while still having the drawbacks of being heavier, harder to tune, and being more prone to errors in general.
It is not possible to say why your Spark recipes are "crashing" or "not working with some users" in the general sense. Please open support tickets, attaching job diagnosis, so that we can review what kind of issues you are experiencing.