Checkpointing in Lab Sessions During Model Training

kaa2020 · ‎06-24-2020

Hi,

I am using Dataiku's Lab interface to create models on some datasets and fine tune them for hyper-parameters. The models are running in-memory and not on Spark.

Unfortunately the system we are running the models on are rather unstable and there a few crashes a day which kills the whole training session every time.

Is there a way to check point the training sessions for non-spark model training and start from where we left after the system recovers from a crash?

Liev · ‎06-29-2020

Hi @kaa2020 DSS will reuse (unless otherwise told) the existing training and test data split, but the models will be built from scratch each time.

Re system being unstable, this is indeed something you should investigate. Consider training a model at a time or with fewer hyperparameters, to reduce training time. This is not ideal since you'd then need to compare across multiple training sessions but it might be a way to get some models trained before the system crashes again.

I hope this helps!

Sign up to take part

Checkpointing in Lab Sessions During Model Training

Checkpointing in Lab Sessions During Model Training