Tutorial | Model monitoring basics#

Get started#

Before we can put a model into into production, we need to consider how to monitor it. As we get further in time from the training data, how do we ensure our model stays relevant?

Objectives#

In this tutorial, you will:

Use the Evaluate recipe and a model evaluation store (MES) to monitor model metrics in situations where you do and do not have access to ground truth data.
Conduct drift analysis to interpret how well the model is performing compared to its initial training.
Create a scenario to retrain the model based on a metric collected in the MES.
Create a model monitoring dashboard.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

Access to an instance of Dataiku 12+.
The Reverse Geocoding plugin (version 2.1 or above) installed on your Dataiku instance. (This plugin is installed by default on Dataiku Cloud).
Broad knowledge of Dataiku (ML Practitioner + Advanced Designer level or equivalent).

Create the project#

We’ll start from a project that includes a basic classification model and a zone for scoring new, incoming data.

From the Dataiku Design homepage, click +New Project > DSS tutorials > MLOps Practitioner > Model Monitoring Basics.
From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

You’ll work with a simple credit card fraud use case. Using data about transactions, merchants, and cardholders, we have a Flow including a model that predicts which transactions should be authorized and which are potentially fraudulent.

A score of 1 for the target variable, authorized_flag, represents an authorized transaction.
A score of 0, on the other hand, is a transaction that failed authorization.

Putting this model into production can enable two different styles of use cases commonly found in machine learning workflows:

Scoring framework	Example
Batch	A bank employee creates a monthly fraud report.
Real-time	A bank’s internal systems authorize each transaction as it happens.

Tip

This use case is just an example to practice monitoring and deploying MLOps projects into production. Rather than thinking about the data here, consider how you’d apply the same techniques and Dataiku features to solve problems that matter to you!

Review the Score recipe#

Before we use the Evaluate recipe for model monitoring, let’s review the purpose of the Score recipe.

The classification model found in the Flow was trained on three months of transaction data between January and March 2017. The new_transactions dataset currently holds the next month of transactions (April 2017).

Verify the contents of the new_transactions dataset found in the Model Scoring Flow zone by navigating to its Settings tab.
Click List Files to find /transactions-prepared-2017-04.csv as the only included file.

Note

The new_transaction_data folder feeding the new_transactions dataset holds nine CSV files: one for each month following the model’s training data. This monthly data has already been prepared using the same transformations as the model’s training data, and so it’s ready to be scored or evaluated.

It also is already labeled. In other words, it has known values for the target authorized_flag column. However, we can ignore these known values, for example, when it comes to scoring or input drift monitoring.

For this quick review, assume new_transactions has empty values for authorized_flag. If this were the case, our next step would be to input these new unknown records and the model to the Score recipe in order to output a prediction of how likely each record is to be fraudulent.

In the Model Scoring Flow zone, select the test_scored dataset.
In the Actions tab of the right panel, select Build.
Select Build Dataset with the Build Only This option.

When finished, compare the schema of new_transactions and test_scored. The Score recipe adds three new columns (proba_0, proba_1, and prediction) to the test_scored dataset.

The Score recipe outputs predictions for new records, but how do we know if these predictions are similar to those produced during model training? That is the key question our monitoring setup will try to address.

See also

You can learn more about model scoring in the Knowledge Base.

Create a model monitoring pipeline for each approach#

There are two basics approaches to model monitoring, and we’ll need a separate pipeline for each one.

Ground truth vs. input drift monitoring#

Over time, a model’s input data may trend differently from its training data. Therefore, a key question for MLOps practitioners is whether a model is still performing well or if it has degraded after being deployed. In other words, is there model drift?

To definitively answer this question, we must know the ground truth, or the correct model output. However, in many cases, obtaining the ground truth can be slow, costly, or incomplete.

In such cases, we must instead rely on input drift evaluation. Using this approach, we compare the model’s training data against the new production data to see if there are significant differences.

See also

See this article on monitoring model performance and drift in production to learn more about ground truth vs. input drift monitoring.

For many real life use cases, these two approaches are not mutually exclusive:

Input drift and prediction drift (to be defined below) are computable as soon as one has enough data to compare. You might calculate it daily or weekly.
Ground truth data, on the other hand, typically comes with a delay and may often be incomplete or require extra data preparation. Therefore, true performance drift monitoring is less frequent. You might only be able to calculate it monthly or quarterly.

Keeping this reality in mind, let’s set up two separate model monitoring pipelines that can run independently of each other.

A model evaluation store for ground truth monitoring#

Let’s start by creating the model monitoring pipeline for cases where the ground truth is available. For this, we’ll need the scored dataset.

From the Model Scoring Flow zone, select both the saved model and the test_scored dataset.
In the Actions tab of the right panel, select the Evaluate recipe.
For Outputs, Set an evaluation store named mes_for_ground_truth.
Click Create Evaluation Store, and then Create Recipe.
For the settings of the Evaluate recipe, adjust the sampling method to Random (approx. nb. records), and keep the default of 10,000.
Click Save.

Take a moment to organize the Flow.

From the Flow, select both the Evaluate recipe and its output MES.
In the Actions tab, select Move.
Select New Zone.
Name the new zone Ground Truth Monitoring.
Click Confirm.

Tip

See the following image below to check your work once you have both pipelines in place.

A model evaluation store for input drift monitoring#

Now let’s create a second model evaluation store for cases where the ground truth is not present following the same process. This time though, we’ll need the “new” transactions, which we can assume have an unknown target variable.

From the Model Scoring Flow zone, select both the saved model and the new_transactions dataset.
In the Actions tab of the right panel, select the Evaluate recipe.
For Outputs, Set an evaluation store named mes_for_input_drift.
Click Create Evaluation Store, and then Create Recipe.
As before, adjust the sampling method to Random (approx. nb. records), and keep the default of 10,000.

Because of the unlabeled input data, there is one important difference in the configuration of the Evaluate recipe for input drift monitoring.

In the Settings tab of the Evaluate recipe, check the box Skip performance metrics computation found in the Output tile.
Save the recipe, and return to the Flow.
Following the steps above, move the second Evaluate recipe and MES into a new Flow zone called Input Drift Monitoring.

Caution

If you do not have the ground truth, you won’t be able to compute performance metrics, and so the recipe would return an error without changing this setting.

We now have one Flow zone dedicated to model monitoring using the ground truth and another Flow zone for the input drift approach.

See also

See the reference documentation to learn more about the Evaluate recipe.

Compare and contrast model monitoring pipelines#

These model evaluation stores are still empty! Let’s evaluate the April 2017 data, the first month beyond our model’s training data.

Build the MES for ground truth monitoring#

Let’s start with the model evaluation store that will have all performance metrics.

In the Ground Truth Monitoring Flow zone, select the mes_for_ground_truth.
In the Actions tab of the right panel, select Build.
Select Build Model Evaluation Store with the Build Only This option.
When the job finishes, open the mes_for_ground_truth.
For the single model evaluation at the bottom, scroll to the right, and observe a full range of performance metrics.

Important

One run of the Evaluate recipe produces one model evaluation.

A model evaluation contains both metadata on the model and input, but also the computed metrics (in this case on data, prediction, and performance).

Build the MES for input drift monitoring#

Now let’s compare it the model evaluation store without performance metrics.

In the Input Drift Monitoring Flow zone, select the mes_for_input_drift.
In the Actions tab of the right panel, select Build.
Select Build Model Evaluation Store with the Build Only This option.
When the job finishes, open the mes_for_input_drift, and observe how performance metrics are not available.

Note

If you examine the job log for building either MES, you may notice an ML diagnostic warning — in particular, a dataset sanity check. As we’re not focused on the actual quality of the model, we can ignore this warning, but in a live situation, you’d want to play close attention to such warnings.

Run more model evaluations#

Before diving into the meaning of these metrics, let’s add more data to the pipelines for more comparisons between the model’s training data and the new “production” data found in the new_transaction_data folder.

Get a new month of transactions#

In the Model Scoring Flow zone, navigate to the Settings tab of the new_transactions dataset.
In the Files subtab, confirm that the Files section field is set to Explicitly select files.
Click the trash can to remove /transactions_prepared_2017_04.csv.
On the right, click List Files to refresh.
Check the box to include /transactions_prepared_2017_05.csv.
Save and refresh the page to confirm that the dataset now only contains data from May.

Rebuild the MES for input drift monitoring#

We can immediately evaluate the new data in the Input Drift Monitoring Flow zone.

In the Input Drift Monitoring Flow zone, select the mes_for_input_drift.
In the Actions tab of the right panel, select Build.
Select Build Model Evaluation Store with the Build Only This option.

Rebuild the MES for ground truth monitoring#

For ground truth monitoring, we first need to send the data through the Evaluate recipe to maintain consistency.

In the Ground Truth Monitoring Flow zone, select the mes_for_ground_truth.
In the Actions tab of the right panel, select Build.
Select Build Upstream.
Click Preview to confirm that the job will run first the Score recipe and then the Evaluate recipe.
Click Run, and open the mes_for_ground_truth to find a second evaluation row when the job has finished.

Tip

At this point, both model evaluation stores should have two rows (two model evaluations). Feel free to repeat the process above for the months of June and beyond so that your model evaluation stores have more data to compare.

Conduct drift analysis#

Now that we have some evaluation data to examine, let’s dive into what information the model evaluation store contains. Recall that our main concern is the model becoming obsolete over time.

The model evaluation store enables monitoring of three different types of model drift:

Input data drift
Prediction drift
Performance drift (when ground truth is available)

See also

See the reference documentation to learn more about drift analysis in Dataiku.

Input data drift#

Input data drift analyzes the distribution of features in the evaluated data.

Open the mes_for_ground_truth.
For the most recent model evaluation at the bottom, click Open.
Navigate to the input data drift panel on the left, and explore the visualizations, clicking Compute as needed.

Note

See the reference documentation on input drift analysis to understand how these figures can provide an early warning sign of model degradation.

Prediction drift#

Prediction drift analyzes the distribution of predictions on the evaluated data.

Remaining within the mes_for_ground_truth, navigate to the Prediction drift panel.
If not already present, click Compute, and explore the output in the fugacity and predicted probability density chart.

Performance drift#

Performance drift analyzes whether the actual performance of the model changes.

Lastly, navigate to the Performance drift panel of the mes_for_ground_truth.
If not already present, click Compute, and explore the table and charts comparing the performance metrics of the current test_scored dataset and reference training data.

Note

Thus far, we’ve only examined the drift analysis for the MES that computes performance metrics. Check the other MES to confirm that performance drift is not available. Moreover, you need to be using at least Dataiku 11.3 to have the prediction drift computed without ground truth.

Automate model monitoring#

Of course, we don’t want to manually build the model evaluation stores every time. We can automate this task with a scenario.

In addition to scheduling the computation of metrics, we can also automate actions based on the results. For example, assume our goal is to automatically retrain the model if a certain metric (such as data drift) exceeds a certain threshold. Let’s create the bare bones of a scenario to accomplish this kind of objective.

Note

In this case, we will be monitoring a MES metric. We can also monitor datasets with data quality rules.

Create a check on a MES metric#

Our first step is to choose a metric important to our use case. Since it’s one of the most common, let’s choose data drift.

From the Flow, open the mes_for_input_drift, and navigate to the Settings tab.
Switch to the Status checks subtab.
Click Metric Value is in a Numeric Range.
Name the check Data Drift < 0.4.
Choose Data Drift as the metric to check.
Set the Soft maximum to 0.3 and the Maximum to 0.4.
Click Check to confirm it returns an error.
Click Save.

Now let’s add this check to the display of metrics for the MES.

Switch to the Status tab.
Click X/Y Metrics.
Add both the data drift metric and the new check to the display.
Click Save once more.

Tip

Here we’ve deliberately chosen a data drift threshold to throw an error. Defining an acceptable level of data drift is dependent on your use case.

Design the scenario#

Just like any other check, we now can use this MES check to control the state of a scenario run.

From the Jobs menu in the top navigation bar, open the Scenarios page.
Click + New Scenario.
Name the scenario Retrain Model.
Click Create.

First, we need the scenario to build the MES.

Navigate to the Steps tab of the new scenario.
Click Add Step
Select Build / Train.
Name the step Build MES.
Click Add Evaluation Store to Build, select mes_for_input_drift, and click Add.

Next, the scenario should run the check we’ve created on the MES.

Still in the Steps tab, click Add Step.
Select Run checks.
Name the step Run MES checks.
Again, click Add Evaluation Store to Check, select mes_for_input_drift, and click Add.

Finally, we need to build the model, but only in cases where the checks fail.

Click Add Step.
Select Build / Train.
Name the step Build model.
Click Add Model to Build, select the saved model, and click Add.
Change the Run this step setting to If some prior step failed (that step being the Run checks step).
Check the box to Reset failure state.
Click Save when finished.

Add a scenario trigger (optional)#

For this demonstration, we’ll trigger the scenario manually, but in real life cases, we’d create a trigger based on how often or under what conditions we’d want to run the scenario.

Let’s imagine we have enough data to make a fair comparison every week.

Switch to the Settings tab of the scenario.
Click Add Trigger > Time-based trigger.
Name the trigger Weekly, and have it repeat every 1 week.
Click Save.

Tip

Feel free to also add a reporter to receive alerts about the scenario’s progress!

Run the scenario#

Let’s introduce another month of data to the pipeline, and then run the scenario.

Return to the new_transactions dataset in the Model Scoring Flow zone.
On the Settings tab, switch the data to the next month as done previously.
Return to the Retrain Model scenario.
Click Run to manually trigger the scenario.
On the Last Runs tab, observe its progression.
Assuming your MES check failed as intended, open the saved model to see a new active version!

Note

This goal of this tutorial is to cover the foundations of model monitoring. But you can also think about how this specific scenario would fail to meet real-world requirements.

For one, it retrained the model on the original data!
Secondly, model monitoring is a production task, and so this kind of scenario should be moved to the Automation node.

Create additional model monitoring assets#

Once you have your model monitoring setup in place, you can start building informative assets on top of it to bring more people into the model monitoring arena.

Create a model monitoring dashboard#

Initially the visualizations inside the MES may be sufficient, but you may soon want to embed these metrics inside a dashboard to more easily share results with collaborators.

From the Dashboards page (G+P), open the project’s default dashboard.
Click Edit near the top right.
Click the plus button to add a tile.
For the first tile, choose Metrics, with type as Model evaluation store, source as mes_for_input_drift, and metric as Data Drift.
Click Add, and adjust the tile to a more readable size (e.g. a 3 x 3 tile square).
Click the Copy icon near the top right of the tile, and click Copy once more to duplicate the tile in the same slide of the dashboard.
For the second tile, in the Tile tab on the right, change Metrics options to History, and adjust the tile size (e.g. a 5 x 3 tile).

Although we could add much more detail, let’s add just one more tile.

Click the plus button to add a third tile, and choose Scenario.
With the Last runs option selected, select Retrain Model as the source scenario, and click Add.
Adjust the tile size to a height of two tiles.
Click Save, and then View to see the foundation of a model monitoring dashboard.

Note

When even more customization is required, you’ll likely want to explore building a custom webapp (which can also be embedded inside a native dashboard).

Optional: Create MES metrics datasets#

Dataiku allows for dedicated datasets for metrics and checks on objects like datasets, saved models, and managed folders. We can do the same for model evaluation stores. These datasets can be particularly useful for feeding into charts, dashboards, and webapps.

Open either MES, and navigate to the Status tab.
Click the gear icon, and select Create dataset from metrics data.
Move the MES metrics dataset to its respective Flow zone.

What’s next?#

You have achieved a great deal in this tutorial! Most importantly, you:

Created pipelines to monitor a model in situations where you do and do not have access to ground truth data.
Used input drift, prediction drift, and performance drift to evaluate model degradation.
Designed a scenario to automate periodic model retraining based on the value of a MES metric.
Gave stakeholders visibility into this process with a basic dashboard.

Now that we have monitoring infrastructure set up on both data quality and the model, let’s learn how to batch deploy to a production infrastructure!