Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
For our data-intensive recipes, we use PySpark to distribute calculations on a kubernetes cluster. However, there are compute-intensive models (e.g. simulation-based) that we would also like to distribute on multiple machines and my question is whether for them Spark is still the best way to do it in DSS. Our criteria are:
1. Startup time to begin simulations on the cluster
2. Costs of using the cluster. I'm mainly referring to the container image size, but maybe there are other aspects here, too.
3. Usability/configuration/maintenance. For Spark it's very simple to use it from a recipe, both in the code and from the UI, and we'd really like it to be the case for any other technology.
4. Anything else important that I'm missing?
Thanks in advance!
There are a few Python-based frameworks to distribute computation like Dask or Ray, but in the realm of data science, Spark remains the industry standard, which is why it has this first-class-citizen integration in the Dataiku platform. All the Spark-related features were designed specifically for Spark itself, not in the mindset of plugging arbitrary distributed computing frameworks on Dataiku. In practice, it's not impossible to make other frameworks work, however it will require a substantial amount of additional work.
Do you have specific tools and/or use-cases in mind that you may want to share ?
Thanks for your reply. In the meantime, we managed to get Spark to work for our needs. It required some trickery, as a typical Spark problem performs relatively simple computations on massive data and our case is the opposite: a little data and lengthy computations, therefore Spark wanted to execute everything on a single node. However, forcing some settings made the trick. We are good now.