How can I avoid a spark.driver.maxResultSize error when running a Visual Analysis

jkonieczny · November 2018

I am attempting to train an ML model using a Visual Analysis and Spark. However, the job fails with the following message:

[10:01:27] [INFO] [dku.utils] - [2018/11/29-10:01:27.734] [task-result-getter-3] [ERROR] [org.apache.spark.scheduler.TaskSetManager] - Total size of serialized results of 714 tasks (2.7 GB) is bigger than spark.driver.maxResultSize (2.0 GB)

This must mean that the job is collecting results into the driver process, but I am not sure what exactly it is collecting. Can I configure the Visual Analysis to not collect any results? Is there a way other than increasing spark.driver.maxResultSize to resolve this issue?

Alex_Combessie · December 2018

Hi,

Spark MLLib is a distributed ML library which requires a lot of technical tuning, compared to other methods. In this case, fortunately, Spark MLLib gives you a recommendation of which parameter to tune. As Visual Analysis requires to collect results from MLLib to analyse the model performance, I advise following this recommendation and increase spark.driver.maxResultSize progressively, 1GB at a time.

Note that if your training set fits in your server memory, I would recommend using the scikit-learn/xgboost. It does not requires the advanced tuning of MLLib and usually perform better (as more algorithms are available).

Hope it helps,

Alex

jkonieczny · December 2018

Thank you Alex, that helps. One follow-up question: does selecting the option "Skip expensive reports" reduce the amount of data collected by the Visual Analysis?

Alex_Combessie · December 2018

Indeed you can try that but note that it would disable some model performance screens.

How can I avoid a spark.driver.maxResultSize error when running a Visual Analysis

Answers

Categories

Setup Info

Tags