- Overview, goals, learning types, and algorithms
- Data selection, preparation, and modeling
- Model evaluation, validation, complexity, and improvement
- Model performance and error analysis
- Unsupervised learning, related fields, and machine learning in practice
Welcome to the third chapter in a five-part series about machine learning.
In this chapter, we’ll continue our machine learning discussion, and focus on problems associated with overfitting data, as well as controlling model complexity, a model evaluation and errors introduction, model validation and tuning, and improving model performance.
Overfitting is one of the greatest concerns in predictive analytics and machine learning. Overfitting refers to a situation where the model chosen to fit the training data fits too well, and essentially captures all of the noise, outliers, and so on.
The consequence of this is that the model will fit the training data very well, but will not accurately predict cases not represented by the training data, and therefore will not generalize well to unseen data. This means that the model performance will be better with the training data than with the test data.
A model is said to have high variance when it leans more towards overfitting, and conversely has high bias when it doesn’t fit the data well enough. A high variance model will tend to be quite flexible and overly complex, while a high bias model will tend to be very opinionated and overly simplified. A good example of a high bias model is fitting a straight line to very nonlinear data.
In both cases, the model will not make very accurate predictions on new data. The ideal situation is to find a model that is not overly biased, nor does it have a high variance. Finding this balance is one of the key skills of a data scientist.
Overfitting can occur for many reasons. A common one is that the training data consists of many features relative to the number of observations or data points. In this case, the data is relatively wide as compared to long.
To address this problem, reducing the number of features can help, or finding more data if possible. The downside to reducing features is that you lose potentially valuable information.
Another option is to use a technique called regularization, which will be discussed later in this series.
Controlling Model Complexity
Model complexity can be characterized by many things, and is a bit subjective. In machine learning, model complexity often refers to the number of features or terms included in a given predictive model, as well as whether the chosen model is linear, nonlinear, and so on. It can also refer to the algorithmic learning complexity or computational complexity.
Overly complex models are less easily interpreted, at greater risk of overfitting, and will likely be more computationally expensive.
There are some really sophisticated and automated methods by which to control, and ultimately reduce model complexity, as well as help prevent overfitting. Some of them are able to help with feature and model selection as well.
These methods include linear model and subset selection, shrinkage methods (including regularization), and dimensionality reduction.
Regularization essentially keeps all features, but reduces (or penalizes) the effect of some features on the model’s predicted values. The reduced effect comes from shrinking the magnitude, and therefore the effect, of some of the model’s term’s coefficients.
The two most popular regularization methods are ridge regression and lasso. Both methods involve adding a tuning parameter (Greek lambda) to the model, which is designed to impose a penalty on each term’s coefficient based on its size, or effect on the model.
The larger the term’s coefficient size, the larger the penalty, which basically means the more the tuning parameter forces the coefficient to be closer to zero. Choosing the value to use for the tuning parameter is critical and can be done using a technique such as cross-validation.
The lasso technique works in a very similar way to ridge regression, but can also be used for feature selection as well. This is due to the fact that the penalty term for each predictor is calculated slightly differently, and can result in certain terms becoming zero since their coefficients can become zero. This essentially removes those terms from the model, and is therefore a form of automatic feature selection.
Ridge regression or lasso techniques may work better for a given situation. Often the lasso works better for data where the response is best modeled as a function of a small number of the predictors, but this isn’t guaranteed. Cross-validation is a great technique for evaluating one technique versus the other.
Given a certain number of predictors (features), there is a calculable number of possible models that can be created with only a subset of the total predictors. An example is when you have 10 predictors, but want to find all possible models using only 2 of the 10 predictors.
Doing this, and then selecting one of the models based on the smallest test error, is known as subset selection, or sometimes as best subset selection. Note that a very useful plot for subset selection is when plotting the residual sum of squares (discussed later) for each model against the number of predictors.
When the number of predictors gets large enough, best subset selection becomes unable to deal with the huge number of possible model combinations for a given subset of predictors. In this case, another method known as stepwise selection can be used. There are two primary versions, forward and backward stepwise selection.
In forward stepwise selection, predictors are added to the model one at a time starting at zero predictors, until all of the predictors are included. Backwards stepwise selection is the opposite, and involves starting with a model including all predictors, and then removing a single predictor at each step.
The model performance is evaluated at each step in both cases. In both subset selection and stepwise selection, the test error is used to determine the best model. There are many ways to estimate test errors, which will be discussed later in this series.
There is a concept that deals with highly dimensional data (i.e., large number of features) known as the curse of dimensionality. The curse of dimensionality refers to the fact that the computational speed and memory required increases exponentially as the number of data dimensions (features) increases.
This can manifest itself as a problem where a machine learning algorithm does not scale well to higher dimensional data11. One way to deal with this issue is to choose a different algorithm that can scale better with the data. The other is a technique known as dimensionality reduction.
Dimensionality reduction is a technique used to reduce the number of features included in the machine learning process. It can help reduce complexity, reduce computational cost, and increase machine learning algorithm computational speed. It can be thought of as a technique that transforms the original predictors to a new, smaller set of predictors, which are then used to fit a model.
Principal component analysis (PCA) was discussed previously in the context of feature selection, but is also a widely-used dimensionality reduction technique as well. It helps reduce the number of features (i.e., dimensions) by finding, separating out, and sorting the features that explain the most variance in the data in descending order. Cross-validation is a great way to determine the number of principal components to include in the model.
An example of this would be a dataset where each observation is described by ten features, but only three of the features can describe the majority of the data’s variance, and therefore are adequate enough for creating a model with, and generating accurate predictions.
Note that people sometimes use PCA to prevent overfitting since fewer features implies that the model is less likely to overfit. While PCA may work in this context, it is not a good approach and is therefore not recommended. Regularization should be used to address overfitting concerns instead8.
Model Evaluation and Performance
Assuming you are working with high quality, unbiased, and representative data, then the next most important aspects of predictive analytics and machine learning is measuring model performance, possibly improving it if needed, and understanding potential errors that are often encountered.
We will have an introductory discussion here about model performance, improvement, and errors, but will continue with much greater detail on these topics in the next chapter.
Model performance is typically used to describe how well a model is able to make predictions on unseen data (e.g., test, but NOT training data), and there are multiple methods and metrics used to assess and gauge model performance. A key measure of model performance is to estimate the model’s test error.
The test error can be estimated either indirectly or directly. It can estimated and adjusted indirectly by making changes that affect the training error, since the training error is a measure of overfitting (bias and/or variance) to some extent.
Recall that the more the model overfits the data (high variance), the less well the model will generalize to unseen data. Given that, the assumption is that reducing variance should improve the test error as well.
The test error can also be estimated directly by testing the model with the held out test data, and usually works best in conjunction with a resampling method such as cross-validation, which we’ll discuss later.
Estimating a model’s test error not only helps determine a model’s performance and accuracy, but is also a very powerful way to select a model too.
Improving Model Performance and Ensemble Learning
There are many ways to improve a model’s performance. The quality and quantity of data used has a huge, if not the biggest impact on model performance, but sometimes these two can’t easily be changed.
Other major influencers on model performance include algorithm tuning, feature engineering, cross-validation, and ensemble methods.
Algorithm tuning refers to the process of tweaking certain values that effectively initialize and control how a machine learning algorithm learns and generates predictive models. This tuning can be used to improve performance using the separate validation data set, and later performance tested with the test dataset.
Since most algorithm tuning parameters are algorithm-specific and sometimes very complex, a detailed discussion is out of scope for this article, but note that the lambda parameter described for regularization is one such tuning parameter.
Ensemble learning, as mentioned in an earlier post, deals with combining or averaging (regression) the results from multiple learning models in order to improve predictive performance. In some cases (classification), ensemble methods can be thought of as a voting process where the majority vote wins.
Two of the most common ensemble methods are bagging (aka bootstrap aggregating) and boosting. Both are helpful with improving model performance and in reducing variance (overfitting) and bias (underfitting).
Bagging is a technique by which the training data is sampled with replacement multiple times. Each time a new training data set is created and a model is fitted to the sample data. The models are then combined to produce the overall model output, which can be used to measure model performance.
Boosting is a technique designed to transform a set of so-called weak learners into a single strong learner. In plain English, think of a weak learner as a model that predicts only slightly better than random guessing, and a strong learner as a model that can predict to certain degree of accuracy better than random guessing.
While complicated, boosting basically works by iteratively creating weak models and adding them to the single strong learner. While this process happens, model accuracy is tested and then weightings are applied so that future learners focus on improving model performance for cases that were previously not well predicted.
Another very popular ensemble method is known as random forests. Random forests are essentially the combination of decision trees and bagging.
Kaggle is arguably the world’s most prestigious data science competition platform, and features competitions that are created and sponsored by most of the notable Silicon Valley tech companies, as well as by other very well-known corporations. Ensemble methods such as random forests and boosting have enjoyed very high success rates in winning these competitions.
Model Validation and Resampling Methods
Model validation is a very important part of the machine learning process. Validation methods consist of creating models and testing them on a validation dataset.
Resulting validation-set error provides an estimate of the test error and is typically assessed using mean squared error (MSE) in the case of a quantitative response, and misclassification rate in the case of a qualitative (discrete) response.
Many validation techniques are categorized as resampling methods, which involve refitting models to different samples formed from a set of training data.
Probably the most popular and noteworthy technique is called cross-validation. The key idea of cross-validation is that the model’s accuracy on the training set is optimistic, and that a better estimate comes from the model’s accuracy on the test set. The idea then is to estimate the test set accuracy while in the model training stage.
The process involves repeated splitting of the data into different training and test sets, building the model on the training set, and then evaluating it on the test set, and finally repeating and averaging the estimated errors.
In addition to model validation and helping to prevent overfitting, cross-validation can be used for feature selection, model selection, model parameter tuning, and comparing different predictors.
A popular special case of cross-validation is known as k-fold cross-validation. This technique involves selecting a number k, which represents the number of partitions of equal size that the original data is divided into. Once divided, a single partition is designated as a validation dataset (i.e., for testing the model), and the remaining k-1 data partitions are used as training data.
Note that typically the larger the chosen k, the less bias, but more variance, and vice versa. In the case of cross-validation, random sampling is done without replacement.
There is another technique that involves random sampling with replacement that is known as the bootstrap. The bootstrap technique tends to underestimate the error more than cross-validation.
Another special case is when k=n, i.e., when k equals the number of observations. In this case, the technique is known as leave-one-out cross-validation (LOOCV).
In this chapter, we have discussed many concepts and techniques associated with model evaluation, validation, complexity, and improvement.
Chapter four of this series will provide a much deeper dive into concepts and metrics related to model performance evaluation and error analysis.
- Wikipedia: Machine Learning
- Wikipedia: Supervised Learning
- Wikipedia: Unsupervised Learning
- Wikipedia: List of machine learning concepts
- Wikipedia: Feature Selection
- Wikipedia: Cross-validation
- Practical Machine Learning Online Course - Johns Hopkins University
- Machine Learning Online Course - Stanford University
- Statistical Learning Online Course - Stanford University
- Wikipedia: Regularization
- Wikipedia: Curse of dimensionality
- Wikipedia: Bagging, aka Bootstrap Aggregating
- Wikipedia: Boosting