EKS and Memory Usage in DATAIKU
Hi Team,
I just want to know about the memory usages in DATAIKU and how EKS in being used by DATAIKU.
1. Let's say I have 100 users using DATAIKU mounted on EC2 single node having memory size of 100GB.
One of the user is running a job to process 100 GB of Data and after few minutes another user started a job, so in this case will a job be queued till the first job is finished?
2. For above point, will EKS resolves the problem? As it having the capability of auto scaling and multi node processing.
3. Let's say I created multiple Dataiku users group based on role and responsibilities and each user group is assigned to a project. So, can I measure the usage of memory based on the user group like how much memory a particular user group is consuming out of total memory?
4. Is there any KPI available for above point? Or do we have a create our own?
Thanks in advance
Answers
-
Hi,
These are fairly advanced sizing, capacity planning and architecture questions, so we would recommend getting in touch with either your Dataiku Customer Success Manager, Technical Account Manager or Partner Manager for deeper dive in your specific needs. These are often questions for which there is no single answer, as it can depend on the specifics of your case.
1. Let's say I have 100 users using DATAIKU mounted on EC2 single node having memory size of 100GB.
One of the user is running a job to process 100 GB of Data and after few minutes another user started a job, so in this case will a job be queued till the first job is finished?It depends a lot on the type of jobs. Most jobs in Dataiku do not load the entire data in memory.
For example, for visual (yellow) recipes:
- First of all, if you are running on SQL datasets, visual recipes will run in the database
- If you have Spark, visual recipes will run on Spark
- Even if you don't and they run with DSS engine (i.e. directly on the DSS machine), visual recipes do not load the entire data in memory but stream the data. So even if you have 100 GB of data to process, the job will only consume ~1 GB of memory.
The situation is more nuanced for code recipes, because basically it depends on what you do. Many Python and R recipes load the entire data in memory (but of course, it depends on the code).
However, loading 100 GB of input data in Python or R is enormous and would almost never happen. It is also important to note that "100 GB of input data" often translates to much much more than 100 GB (more like 1000 to 2000 GB) of memory, because the input data is usually compressed.
There is no way for Dataiku to know "beforehand" whether a code recipe will take a lot of memory. So Dataiku provides cgroups integration (https://doc.dataiku.com/dss/latest/operations/cgroups.html) which allow you to limit and protect against runaway user recipes, by limiting the amount of memory that a job or a user can take.
Dataiku does not "queue" jobs based on memory, it queues jobs based on how many jobs are already running.
2. For above point, will EKS resolves the problem? As it having the capability of auto scaling and multi node processing.
Leveraging Kubernetes for containerized execution indeed allows you to scale out the execution of many jobs:
- Python and R recipes and notebooks
- Machine learning
- Webapps
- Visual recipes running with Spark engine
- Pyspark recipes and notebooks
It indeed allows these workloads to "tap" into a potentially infinite pool of memory, subject to the constraints set by the admin on max number of nodes.
So yes, this allows multiple such workloads to run at the same time, consuming a total memory much higher than what's available on the Dataiku instance.
It's however good to remember a few things:
- This does not make each recipe able to use more memory than what's available on each EKS node
- This does not make these recipes faster or more scalable
- We want to insist again that a Python or R recipe actually loading in memory 100 GB of input data is something very very rare.
3. Let's say I created multiple Dataiku users group based on role and responsibilities and each user group is assigned to a project. So, can I measure the usage of memory based on the user group like how much memory a particular user group is consuming out of total memory?
Yes, Dataiku includes advanced resource usage consumption monitoring that allows you to collate memory and CPU usage by user, group, project or whatever grouping you want. The concept is that we give you granular resource usage data which you can analyze as you want with Dataiku itself.
We have more details here: https://doc.dataiku.com/dss/latest/operations/compute-resource-usage-reporting.html
4. Is there any KPI available for above point? Or do we have a create our own?
I am not sure I understand the question here. We do not have "average" or "expected" kind of data for compute resource usage, as this is so wildly different from one customer to the other, depending on whether they do a lot of visual recipes, code recipes, Spark, machine learning, ...
We have customers with hundreds of users on a single large node backed by SQL databases and no elastic computation capabilities. We have customers with dozens of EKS nodes. We have quite a mix really, so diving into what is the best for you would require us to better understand your use case, for which you would need to contact your Dataiku contact points.
Hope this helps,