DSS Memory Optimization tips: Backend, Python/R, Spark jobs
Sometimes, users running DSS jobs and processes can get the following error:
“OutOfMemoryError: Java Heap Space” or “GC overhead limit exceeded”
This happens when a Java process (like the DSS backend, or a Spark job) exceeds its maximum memory allocation, called the "Xmx".
In this article, we try to present a few explanations on the way DSS works with regards to memory allocation and offer some recommendations to optimize.
A) DSS Processes
First, it is important to remember the key moving parts of a Dataiku DSS system.
- If the backend fails often, you can increase the backend.xmx value in the install.ini file in the DSS data_directory (link to the doc)
- If you see JEK crashes due to memory errors, you may need to increase it from 2g (the default) to 3 or 4g, but very rarely more.
Spark Jobs
- A single Java process called the driver, which does mostly orchestration, and sometimes needs memory for collecting local results. The Xmx of the driver is controlled by the Spark configuration key spark.driver.memory and rarely needs to be changed.
- Many other Java processes called the executors which do the actual work. Their default memory allocation is 2g in Spark, 2.4g in DSS.
If you encounter "Lost task ... java.lang.OutOfMemoryError: GC overhead limit exceeded" or "Lost task ... java.lang.OutOfMemoryError: Java Heap space" errors while running Spark jobs, you need to increase spark.executor.memory.
Since Yarn also takes into account the executor memory and overhead, if you increase spark.executor.memory a lot, don't forget to also increase spark.yarn.
C) Python / R recipes and notebooks
When DSS runs a Python or R recipe, a corresponding Python or R process is created, and logs for these processes appear directly in the job logs.
When a user creates a notebook, a specific process is created (Python process for Python notebooks, R process for R notebooks...) that holds the actual computation state of the notebook, and is called a “Jupyter kernel”.
Memory allocation for kernels and recipe processes can be controlled using the cgroups integration capabilities of DSS. This allows you to restrict the usage of memory or CPU by different processes, for example by process type, user or project...
See detailed documentation for cgroups integration.
N.B.: This does not apply if you are using containerized execution for the recipes or notebooks.