For our data-intensive recipes, we use PySpark to distribute calculations on a kubernetes cluster. However, there are compute-intensive models (e.g. simulation-based) that we would also like to distribute on multiple machines and my question is whether for them Spark is still the best way to do it in DSS. Our criteria are:
1. Startup time to begin simulations on the cluster
2. Costs of using the cluster. I'm mainly referring to the container image size, but maybe there are other aspects here, too.
3. Usability/configuration/maintenance. For Spark it's very simple to use it from a recipe, both in the code and from the UI, and we'd really like it to be the case for any other technology.