How can I avoid a spark.driver.maxResultSize error when running a Visual Analysis

Options
jkonieczny
jkonieczny Registered Posts: 13 ✭✭✭✭

I am attempting to train an ML model using a Visual Analysis and Spark. However, the job fails with the following message:

[10:01:27] [INFO] [dku.utils] - [2018/11/29-10:01:27.734] [task-result-getter-3] [ERROR] [org.apache.spark.scheduler.TaskSetManager] - Total size of serialized results of 714 tasks (2.7 GB) is bigger than spark.driver.maxResultSize (2.0 GB)

This must mean that the job is collecting results into the driver process, but I am not sure what exactly it is collecting. Can I configure the Visual Analysis to not collect any results? Is there a way other than increasing spark.driver.maxResultSize to resolve this issue?

Answers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    Hi,

    Spark MLLib is a distributed ML library which requires a lot of technical tuning, compared to other methods. In this case, fortunately, Spark MLLib gives you a recommendation of which parameter to tune. As Visual Analysis requires to collect results from MLLib to analyse the model performance, I advise following this recommendation and increase spark.driver.maxResultSize progressively, 1GB at a time.

    Note that if your training set fits in your server memory, I would recommend using the scikit-learn/xgboost. It does not requires the advanced tuning of MLLib and usually perform better (as more algorithms are available).

    Hope it helps,

    Alex
  • jkonieczny
    jkonieczny Registered Posts: 13 ✭✭✭✭
    Options
    Thank you Alex, that helps. One follow-up question: does selecting the option "Skip expensive reports" reduce the amount of data collected by the Visual Analysis?
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    Indeed you can try that but note that it would disable some model performance screens.
Setup Info
    Tags
      Help me…