I am looking for a bit of advice here.
Technically, I'm handling data sources between 5 - 20GB. Let's say you have 10 customers from which each has about 5 datapipelines that run every 10 minutes. Each job takes a dataset of around 1GB and we would do 10 jobs concurrently. In short, our workload isn't really big on the data but many on the jobs.
Now let's say our machine that runs Dataiku is 24GB RAM with 6 cores at 2.8-3.1ghz. This would do the job just fine I imagine.
But when the amount of customers with the same amount of datapipelines grow and so does the instance's parallel workload, how should we go about scaling. I can imagine that due to the dataset sizes, vertical scaling would be recommended. But then we'd have to manually scale up and down every time we connect or disconnect customers resulting in downtime which isn't really desired and we'd either have to overprovision in advance as to minimize upscaling all the time which results in wasted resources and funds.
So I was thinking of doing the parallelization by adding Spark nodes instead. But Spark is not really optimized for small workloads and was designed for the big data problems and brings in overhead that might be much slower than running everything in memory. Yet the option to easily add and remove nodes that would reflect the amount of customers is a very efficient and cost-effective.
The other solution would be to have a somewhat same setup with adding nodes and run celery tasks on them. But the built-in Dataiku recipes do not work with these celery servers. They are however compatible with Spark.
My question would be: using Dataiku that offers Spark as an engine, and having the use case above, would you still recommend scaling vertically or start out with some way of horizontal scaling, and if so, what performance impact or gain should we expect?
I realize this is somewhat an opinionated question but since this question directly relates to using Dataiku, I hope I could get some directions here.