An optimal flow for model evaluation and retraining

Options
ben_p
ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

Hi everyone,

I am working on a model flow and looking to add in some automated evaluation and retraining. I've built a functional flow,. but I feel like it could be slicker, does anyone have any tips on how I could improve it?

My flow looks like this:

ben_p_0-1586963030364.png

Steps are as follows:

  1. Read training data in from Google BigQuery (Training).
  2. Send this data to Google Cloud Storage (because connecting the BigQuery dataset directly for modelling is painfully slow!).
  3. Split the training data so I have a holdout set for later evaluation.
  4. Train my model on the training data.
  5. Evaluate my model on the holdout dataset, from my earlier split.
  6. Run checks on the model evaluation AUC and kick off a model retrain if below a certain threshold.
  7. Read in my prediction data (as above), and generate predictions.

I want to predict every day, but I don't want to train my model every day.

In this current flow I need to load in all my Training data in order to split it out and evaluate my model, is this normal? In this way, if I wanted to evaluate my model every day I would have to load in and split my full training dataset, again, is this normal?

Thanks for your help,

Ben

Best Answer

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Answer ✓
    Options

    Hi,

    The logic of the flow fits the use case you describe well.

    The challenge you are describing with the time to load data is linked to your specific GCS/BigQuery setup. I see that you have opened another discussion on this specific issue.

    I believe that once you solve this upstream problem, then the rest of the logic of your flow is good to go.

    Hope it helps,

    Alex

Answers

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭
    Options

    Thanks Alex, yes indeed, improving the speed does make it much manageable to evaluate with a higher frequency - thanks for giving the flow a look over, it's my first evaluation loop, so good to know I'm on the right track!

    Ben

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options

    By the way, you may also achieve some processing speed-up by pushing down the computation to BigQuery (BQ). The Prepare recipe can convert most processors to BQ as long as input and output are both in BQ.

    The Split recipe cannot push directly to BQ as of today. Instead, you could use two Filter recipes, which can push to BQ.

    That should provide a nice speed-up!

Setup Info
    Tags
      Help me…