# Machine Learning: An In-Depth Guide - Model Performance and Error Analysis

## Introduction

Welcome to the fourth article in a five-part series about machine learning.

In this article, we will take a deeper dive into model evaluation and performance metrics, and potential prediction-related errors that one may encounter.

## Residuals and Classification Results

Before digging deeper into model performance and error types, we must first discuss the concept of residuals and errors for regression, positive and negative classifications for classification problems, and in-sample versus out-of-sample measurements.

Any reference to models, metrics, or errors computed with respect to the data used to train, validate, or tune a predictive model (i.e., data you have) is called in-sample. Conversely, reference to test data metrics and errors, or new data in general is called out-of-sample (i.e., data you don’t have).

Recall that regression involves predicting a continuous valued output (response) based on some set of input variables (features/predictors). The difference between the model’s predicted response value and the actual observed response value from the in-sample data is called the residual for each point, and residuals refers collectively to all of the differences between all predicted and actual values. Each out-of-sample (new/test data) difference is called a prediction error instead of residual.

For the classification case, and for simplicity, we will only discuss binary classification (two classes). Prior to performing classification on data observations, one must define what is a positive classification and what is a negative classification. In the case of spam or ham (i.e., not spam), spam may be the positive designation and ham is the negative.

If a model predicts an incoming email as being spam, and it really is spam, then that’s considered a true positive. Positive since the model predicted spam (the positive class), and true because the actual class matched the prediction. Conversely, if an incoming email is labeled spam when it’s actually not spam, it is considered a false positive.

Given this, we can see that the results of a classification model on new data can fall into four potential buckets. These include: true positives, false positives (type 1 error), true negatives, and false negatives (type 2 error). In all four cases, true or false refers to whether the actual class matched the predicted class, and positive or negative refers to which classification was assigned to an observation by the model.

Note that false is synonymous with error in this case since the model failed to predict correctly.

## Model Performance Overview

Now that we’ve covered residuals and classification result types, we will begin the discussion of model performance metrics that are based on these concepts.

Here is a non-exhaustive list of model evaluation methods, visualizations, and performance metrics that are used in machine learning and predictive analytics. They are categorized by their most common use case, but some may apply to more than one category (e.g., accuracy).

In addition to model evaluation, many of these can also be used for model comparison, selection, and tuning. Many of these are very powerful when combined with the cross-validation technique described earlier in this series.

• Regression performance

• R2 and adjusted R2 (aka explained variance)

• Mean squared error (MSE), or root mean squared error (RMSE)

• Mean error, or mean absolute error

• Median error, or median absolute error

• Classification performance

• Confusion matrix

• Precision

• Recall (aka sensitivity)

• Specificity

• Accuracy

• Lift

• Area under the ROC curve (AUC)

• F-score

• Log-loss

• Average precision

• Precision/recall break-even point

• Root mean squared error (RMSE)

• Mean cross entropy

• Probability calibration

• Bias variance tradeoff and model complexity

• Validation curve

• Learning curve

• Residual sum of squares

• Goodness-of-fit metrics

• Model validation and selection

• Mallow’s Cp

• Akaike information criterion (AIC)

• Bayesian information criterion (BIC)

Performance metrics should be chosen based on the problem domain, project goals, and the business objectives. Unfortunately there isn’t a one-size-fits-all approach, and often there are tradeoffs to consider.

While a discussion of all of these methods and metrics is out of scope for this series, we will cover a few key ones next.

## Model Performance Evaluation Metrics

Regression

There are many metrics for determining model performance for regression problems, but the most commonly used metric is known as the mean square error (MSE), or variation called the root mean square error (RMSE), which is calculated by taking the square root of the mean squared error. The root mean square error is typically preferred since taking the square root changes the units of the error measurement to be the same and proportional to the response variable’s units.

The error in this case is the difference in value between a given model prediction and its actual value for an out-of-sample observation. The mean squared error is therefore the average of all of the squared errors across all new observations, which is the same as adding all of the squared errors (sum of squares) and dividing by the number of observations.

In addition to being used as a stand-alone performance metric, mean squared error (or RMSE) can also be used for model selection, controlling model complexity, and model tuning. Often many models are created and evaluated (e.g., cross-validation), and then MSE (or similar metric) is plotted on the y-axis, with the tuning or validation parameter given on the x-axis.

The tuning or validation parameter is changed in each model creation and evaluation step, and the plot described above can help determine the ideal tuning parameter value. The number of predictors is a great example of a potential tuning parameter in this case.

Before moving on to classification, it is worth mentioning R2 briefly. R2 is often thought of as a measure of model performance, but it’s actually not. R2 is a measure of the amount of variance explained by the model, and is given as a number between 0 and 1. A value of 1 means the model explains all of the data perfectly, but when computed with training data is more of an indication of potential overfitting than high predictive performance.

As discussed earlier, the more complex the model, the more the model tends to fit the data better and potentially overfit, or contribute to additional model variance. Given this, adjusted R2 is a more robust and reliable metric in that it adjusts for any increases in model complexity (e.g., adding more predictors), so that one can better gauge underlying model improvement in lieu of the increased complexity.

Classification

Recall the different results from a binary classifier, which are true positives, true negatives, false positives, and false negatives. These are often shown in a confusion matrix. Here is a very generalized and comprehensive example of one from Wikipedia, and note that the graphic is shown with concepts and metrics, and not actual data.

And here is an example from Wikipedia with the values filled in for different classifier models evaluated against 200 observations. Note the calculation and variation of the metrics across the different models.

A confusion matrix is conceptually the basis of many classification performance metrics as shown. We will discuss a few of the more popular ones associated with machine learning here.

Accuracy is a key measure of performance, and is more specifically the rate at which the model is able to predict the correct value (classification or regression) for a given data point or observation. In other words, accuracy is the proportion of correct predictions out of all predictions made.

The other two metrics from the confusion matrix worth discussing are precision and recall. Precision (positive predictive value) is the ratio of true positives to the total amount of positive predictions made (i.e., true or false). Said another way, precision measures the proportion of accurate positive predictions out of all positive predictions made.

Recall on the other hand, or true positive rate, is the ratio of true positives to the total amount of actual positives, whether predicted correctly or not. So in other words, recall measures the proportion of accurate positive predictions out of all actual positive observations.

A metric that is associated with precision and recall is called the F-score (also called F1 score), which combines them mathematically, and somewhat like a weighted average, in order to produce a single measure of performance based on the simultaneous values of both. Its values range from 0 (worst) to 1 (best).

Another important concept to know about is the receiver operating characteristic, which when plotted, results in what’s known as an ROC curve (shown below, image courtesy of BOR at the English language Wikipedia).

An ROC curve is a two-dimensional plot of sensitivity (recall, or true positive rate) vs specificity (false positive rate). The area under the curve is referred to as the AUC, and is a numeric metric used to represent the quality and performance of the classifier (model).

An AUC of 0.5 is essentially the same as random guessing without a model, whereas an AUC of 1.0 is considered a perfect classifier. Generally, the higher the AUC value the better, and an AUC above 0.8 is considered quite good.

The higher the AUC value, the closer the curve gets to the upper left corner of the plot. One can easily see from the ROC curves then that the goal is to find and tune a model that maximizes the true positive rate, while simultaneously minimizing the false positive rate. Said another way, the goal as shown by the ROC curve is to correctly predict as many of the actual positives as possible, while also predicting as many of the actual negatives as possible, and therefore minimize errors (incorrect classifications) for both.

As mentioned previously in this series, model performance can be measured in many ways, and the method used should be chosen based on project goals, business domain considerations, and so on.

It is also worth noting that according to many experts, different performance metrics are thought to be biased for varying reasons. Given the breadth and complexity of this topic, the reader is encouraged to refer to external resources for further information on performance evaluation and the tradeoffs involved.

## Error Analysis and Tradeoffs

There are multiple types of errors associated with machine learning and predictive analytics. The primary types are in-sample and out-of-sample errors. In-sample errors (aka resubstitution errors) are the error rate found from the training data, i.e., the data used to build predictive models.

Out-of-sample errors (aka generalization errors) are the error rates found on a new data set, and are the most important since they represent the potential performance of a given predictive model on new and unseen data.

In-sample error rates may be very low and seem to be indicative of a high-performing model, but one must be careful, as this may be due to overfitting as mentioned, which would result in a model that is unable to generalize well to new data.

Training and validation data is used to build, validate, and tune a model, but test data is used to evaluate model performance and generalization capability. One very important point to note is that prediction performance and error analysis should only be done on test data, when evaluating a model for use on non-training or new data (out-of-sample).

Generally speaking, model performance on training data tends to be optimistic, and therefore data errors will be less than those involving test data. There are tradeoffs between the types of errors that a machine learning practitioner must consider and often choose to accept.

For binary classification problems, there are two primary types of errors. Type 1 errors (false positives) and Type 2 errors (false negatives). It’s often possible through model selection and tuning to increase one while decreasing the other, and often one must choose which error type is more acceptable. This can be a major tradeoff consideration depending on the situation.

A typical example of this tradeoff dilemma involves cancer diagnosis, where the positive diagnosis of having cancer is based on some test. In this case, a false positive means that someone is told that have have cancer when they do not. Conversely, the false negative case is when someone is told that they do not have cancer when they actually do.

If no model is perfect, then in the example above, which is the more acceptable error type? In other words, of which one can we accept to a greater degree?

Telling someone they have cancer when they don’t can result in tremendous emotional distress, stress, additional tests and medical costs, and so on. On the other hand, failing to detect cancer in someone that actually has it can mean the difference between life and death.

In the spam or ham case, neither error type is nearly as serious as the cancer case, but typically email vendors err slightly more on the side of letting some spam get into your inbox as opposed to you missing a very important email because the spam classifier is too aggressive.

## Summary

In this article, we have discussed many concepts and metrics associated with model evaluation, performance, and error analysis.

The fifth and final article of this series will revisit unsupervised learning in greater detail, followed by an overview of similar and highly related fields to machine learning. This series will conclude with an overview of machine learning as used in real world applications.

Stay tuned!

Alex Castrounis