An optimal flow for model evaluation and retraining

Solved!
ben_p
Level 5
An optimal flow for model evaluation and retraining

Hi everyone,

I am working on a model flow and looking to add in some automated evaluation and retraining. I've built a functional flow,. but I feel like it could be slicker, does anyone have any tips on how I could improve it?

My flow looks like this:

ben_p_0-1586963030364.png

Steps are as follows:

  1. Read training data in from Google BigQuery (Training).
  2. Send this data to Google Cloud Storage (because connecting the BigQuery dataset directly for modelling is painfully slow!).
  3. Split the training data so I have a holdout set for later evaluation.
  4. Train my model on the training data.
  5. Evaluate my model on the holdout dataset, from my earlier split.
  6. Run checks on the model evaluation AUC and kick off a model retrain if below a certain threshold.
  7. Read in my prediction data (as above), and generate predictions.

I want to predict every day, but I don't want to train my model every day.

In this current flow I need to load in all my Training data in order to split it out and evaluate my model, is this normal? In this way, if I wanted to evaluate my model every day I would have to load in and split my full training dataset, again, is this normal? 🙂

Thanks for your help,

Ben

1 Solution
Alex_Combessie
Dataiker Alumni

Hi,

The logic of the flow fits the use case you describe well.

The challenge you are describing with the time to load data is linked to your specific GCS/BigQuery setup. I see that you have opened another discussion on this specific issue. 

I believe that once you solve this upstream problem, then the rest of the logic of your flow is good to go.

Hope it helps,

Alex

View solution in original post

3 Replies
Alex_Combessie
Dataiker Alumni

Hi,

The logic of the flow fits the use case you describe well.

The challenge you are describing with the time to load data is linked to your specific GCS/BigQuery setup. I see that you have opened another discussion on this specific issue. 

I believe that once you solve this upstream problem, then the rest of the logic of your flow is good to go.

Hope it helps,

Alex

ben_p
Level 5
Author

Thanks Alex, yes indeed, improving the speed does make it much manageable to evaluate with a higher frequency - thanks for giving the flow a look over, it's my first evaluation loop, so good to know I'm on the right track!

Ben

0 Kudos
Alex_Combessie
Dataiker Alumni

👍

By the way, you may also achieve some processing speed-up by pushing down the computation to BigQuery (BQ). The Prepare recipe can convert most processors to BQ as long as input and output are both in BQ.

The Split recipe cannot push directly to BQ as of today. Instead, you could use two Filter recipes, which can push to BQ. 

That should provide a nice speed-up!

0 Kudos