Horizon or vertical scaling of Dataiku on small datasets?
I am looking for a bit of advice here.
Technically, I'm handling data sources between 5 - 20GB. Let's say you have 10 customers from which each has about 5 datapipelines that run every 10 minutes. Each job takes a dataset of around 1GB and we would do 10 jobs concurrently. In short, our workload isn't really big on the data but many on the jobs.
Now let's say our machine that runs Dataiku is 24GB RAM with 6 cores at 2.8-3.1ghz. This would do the job just fine I imagine.
But when the amount of customers with the same amount of datapipelines grow and so does the instance's parallel workload, how should we go about scaling. I can imagine that due to the dataset sizes, vertical scaling would be recommended. But then we'd have to manually scale up and down every time we connect or disconnect customers resulting in downtime which isn't really desired and we'd either have to overprovision in advance as to minimize upscaling all the time which results in wasted resources and funds.
So I was thinking of doing the parallelization by adding Spark nodes instead. But Spark is not really optimized for small workloads and was designed for the big data problems and brings in overhead that might be much slower than running everything in memory. Yet the option to easily add and remove nodes that would reflect the amount of customers is a very efficient and cost-effective.
The other solution would be to have a somewhat same setup with adding nodes and run celery tasks on them. But the built-in Dataiku recipes do not work with these celery servers. They are however compatible with Spark.
My question would be: using Dataiku that offers Spark as an engine, and having the use case above, would you still recommend scaling vertically or start out with some way of horizontal scaling, and if so, what performance impact or gain should we expect?
I realize this is somewhat an opinionated question but since this question directly relates to using Dataiku, I hope I could get some directions here.
The core principle of Dataiku is to offload the computations to external databases or clusters. This is documented in details in this article: https://www.dataiku.com/learn/guide/getting-started/dss-concepts/where-does-it-all-happen.html
So this question really depends on which database/cluster technology you want to use. In the case of Hadoop, it is quite straightforward to scale by adding nodes. But the same applies to most SQL/noSQL databases. If you work in the cloud, there are many options for autoscaling.
Note that we added Kubernetes as a compute engine for Python, R and Machine Learning jobs, so you can also have a look at scaling your Kubernetes clusters.
I was about to ask how to scale with Python recipes. Since I assume whenever you're using a Python recipe, this will be run on the Dataiku machine. Do I need to containerize these applications on Kubernetes to scale their workloads to external resources?
In order to scale beyond the Dataiku machine, you will need an external Kubernetes cluster. This will allow you to push all Python jobs (including Dataiku visual ML with Python backend) to containers running in your Kubernetes cluster. The details of the integration are explained here: https://doc.dataiku.com/dss/latest/apinode/kubernetes/index.html