Machine Learning: An In-Depth Guide - Data Selection, Preparation, and Modeling


  1. Overview, goals, learning types, and algorithms
  2. Data selection, preparation, and modeling
  3. Model evaluation, validation, complexity, and improvement
  4. Model performance and error analysis
  5. Unsupervised learning, related fields, and machine learning in practice


Welcome to the second article in a five-part series about machine learning.

In this article, we will briefly introduce model performance concepts, and then focus on the following parts of the machine learning process: data selection, preprocessing, feature selection, model selection, and model tradeoff considerations.

InnoArchiTech post image

Model Performance Introduction

Model performance can be defined in many ways, but in general, it refers to how effectively the model is able to achieve the solution goals for a given problem (e.g., prediction, classification, anomaly detection, recommendation).

Since the goals can differ for each problem, the measure of performance can differ as well. Some common performance measures include accuracy, precision, recall, receiver operator characteristic (ROC), and so on. These will be discussed in much greater detail throughout the rest of this series.

Data Selection and Preprocessing

Some say that garbage in equals garbage out, and this is definitely the case. This basically means that you may have built a predictive model, but it doesn’t matter if the data used to build the model is non-representative, low quality, error ridden, and so on. The quality, amount, preparation, and selection of data is critical to the success of a machine learning solution.

The first step to ensure success is to avoid selection bias. Selection bias occurs when the samples used to produce the model are not fully representative of cases that the model may be used for in the future, particularly with new and unseen data.

Data is typically messy and often consists of missing values, useless values (e.g., NA), outliers, and so on. Prior to modeling and analysis, raw data needs to be parsed, cleaned, transformed, and pre-processed. This is typically referred to a data munging or data wrangling.

For missing data, data is often imputed, which is a technique used to fill in, or substitute for missing values, and is very similar conceptually to interpolation.

In addition, sometimes feature values are scaled (feature scaling) and/or standardized (normalized). The most typical method of standardizing feature data is to subtract the mean across a given feature’s values from each individual observation value, and then divide by the standard deviation of that feature’s values.

Feature scaling is used to bring the different feature’s value ranges into similarity in order to help prevent certain features from dominating models and predictions, but also to prevent computing problems when running machine learning optimization algorithms (speed, convergence, etc.).

Another preprocessing technique is to create dummy variables, which basically means that you convert qualitative variables to quantitative variables. An example is taking a color feature (e.g., green, red, and blue), and transforming it to the values 1, 2, and 3 respectively. This makes it possible to perform regression with qualitative features.

InnoArchiTech post image

Data Splitting

Recall from the first article that the data used for machine learning should be split into training and test datasets, as well as an optional third validation dataset for model validation and tuning.

Choosing the size of each data set can be somewhat subjective and dependent on the overall sample size, and a full discussion is out of scope for this series. As an example however, given a training and test dataset only, some people may split the data into 80% training and 20% testing.

In general, more training data results in a better model and potential performance, and more testing data results in a greater evaluation of model performance and overall generalization capability.

Feature Selection and Feature Engineering

Once you have a representative, unbiased, cleaned, and fully prepared dataset, typical next steps include feature selection and feature engineering of the training data. Note that although discussed here, both of these techniques can also be used later in the process for improving model performance.

Feature selection is the process of selecting a subset of features from which to build a predictive regression model or classifier. This is usually done for model simplification and increased interpretability, reducing training times and computational cost, and to help reduce the risk of overfitting, and thus improve model generalization.

Basic techniques for feature selection, particularly for regression problems, involve estimates of model parameters (i.e., model coefficients) and their significance, and correlation estimates amongst features. This will be discussed further in a section about parametric models.

Some advanced techniques used for feature selection are principle component analysis (PCA), singular value decomposition (SVD), and Linear Discriminant Analysis (LDA).

Principal component analysis is a statistical technique that deals with determining which features, in order, represent the most to least variance in the data. Singular value decomposition is a lower level linear algebra algorithm that is used by PCA.

Linear discriminant analysis is closely related to PCA in that they’re both linear transformation techniques. PCA however is more general and is not concerned with class labels (unsupervised), whereas LDA is more specific and is concerned with class labels (supervised).

Feature engineering includes feature selection as a sub-category, but also involves other aspects such as creating new features, transforming raw data into domain-specific and interpretable features, and so on.

Parametric Models and Feature Selection

Many machine learning models are a type of parametric model. A good example is the equation describing a line (i.e., linear model), which is shown here, and includes the slope (β), intercept coefficient (α), and an error term (ε).

InnoArchiTech post image

With parametric models, the coefficients of the terms are called the parameters, and are usually designated by the Greek letter beta and a subscript (e.g., β1 … βn). In regression problems, the parameters are called regression coefficients.

Many models also include an error term, indicated by the Greek letter epsilon. Simply stated, this error term is meant to account for the difference between the model’s predicted value and the actual observed value for a given set of input values.

Understanding the concept of model parameters is very important for supervised learning because machine learning differs from other techniques, in that it learns model parameters automatically. It does this by estimating the optimal set of model parameters that best explains the relationship between the response variable and the independent feature variables through optimization techniques, as discussed in the first article.

In regression problems, a p-value is assigned to each of the estimated model parameters (regression coefficients), and this value is used to indicate the potential predictive influence that each coefficient has on the response.

Coefficients with a p-value greater than some chosen threshold, typically 0.05 or 0.10, are often not included in the model since they will most likely not help explain (predict) the response. This is one key way to perform feature selection with parametric models.

Another technique involves estimating the correlation of the features with respect to the response, and removing redundant and highly correlated features. The idea is that including only one of a pair of correlated features (the most significant) should be enough to explain the impact of both of the correlated features on the response.

InnoArchiTech post image

Model Selection

While the algorithm or model that you choose may not matter as much as other things discussed in this series (e.g., amount of data, feature selection, etc.), here is a list of things to take into account when choosing a model.

  • Interpretability
  • Simplicity (aka parsimony)
  • Accuracy
  • Speed (training, testing, and real-time processing)
  • Scalability

A good approach is to start with simple models and then increase model complexity as needed, and only when necessary. Generally, simplicity should be preferred unless you can achieve major accuracy gains through model selection.

Relatively simple models include simple and multiple linear regression for regression problems, and logistic and multinomial regression for classification problems.

A basic early model selection choice for supervised learning is whether to use a linear or nonlinear model. Nonlinear models best describe and predict situations when the effects on the response from certain feature values and their combination is nonlinear. Most of the time however, relationships are never truly linear.

Beyond basic linear models, variations in the response variable can also be due to interaction effects, which means that the response is dependent not only on certain individual features (main effects), but also on the combination of certain features (interaction effects). This combination of features in a model is represented by multiplying the feature values for each interaction term in the model (e.g., βx1x2) with a term coefficient.

Once interaction terms are included, the significance of the interactions in explaining the response, and whether to include them, can be determined through the usual methods such as p-value estimation. Note that there is a concept known as the hierarchy principle, which basically says that if an interaction is included in a model, the associated main effects should also be included.

While linear assumptions are often good enough and can produce adequate results, most real life feature/response relationships are nonlinear, and sometimes nonlinear models are required to get an acceptable level of accuracy. In this case, there are a wide variety of models to choose from.

Nonlinear models can include different degree polynomials, step functions, piecewise polynomials, splines, local regression (aka LOESS models), and generalized additive models (GAM). Due to the technical nature of nonlinear modeling, familiarity with the above model approaches by name should suffice for the purpose of this series.

Other notable model choices include decision trees, support vector machines (SVM), and artificial neural networks (modeled after biological neural networks, an interconnected system of neurons). Decision trees can be highly interpretable, while the latter two are black box and very complex technical methods. Decision trees involve creating a series of splits based on logical decisions, starting from the most important top-level node. Decision trees visually look like an upside down tree.

Here is an example of a decision tree created by Stephen Milborrow, which shows survival of passengers on board the Titanic. The term ‘sibsp’ is the number of spouses or siblings aboard, and the numbers under each leaf refer to the probability of survival and the percentage of the total observations (i.e., people on board). So the upper right leaf indicates that females that survived had a 73% chance of survival and represented 36% of those on board.

By Stephen Milborrow (Own work) CC BY-SA 3.0 ( or GFDL (, via Wikimedia Commons

The final model selection decision discussed here is whether to leverage ensemble methods for additional performance gains. These methods combine models to produce a single consensus prediction or classification, and do so through averaging or voting techniques.

Some very common ensemble methods are bagging, boosting, and random forests. Random forests are essentially bagging applied to decision trees, with the additional element of random feature subset selection.

Model Tradeoffs

Model accuracy is determined in many ways, and will be discussed in detail later in this series. The primary measure of model accuracy comes from estimating the test error for a given model. The accuracy improvement goal of model selection is therefore to reduce the estimated test error.

It is important to note that the goal isn’t to find the absolute minimal error, but rather to find the simplest model that performs well enough. There are usually diminishing returns in trying the squeeze out the very last bit of performance. Given this, your choice of modeling approach won’t always be based on the one that results in the greatest degree of accuracy. Sometimes there are other important factors that must be taken into account as well, including interpretability, simplicity, speed, and scalability.

Often, it’s a tradeoff choosing whether prediction accuracy or model interpretability is more important for a given application. Artificial neural networks, support vector machines, and some ensemble methods can be used to create very accurate predictive models, but are very much of a black box except to highly specialized and technical individuals.

Black box algorithms may be preferred when predictive performance is the most important goal, and it’s not necessary to explain how the model works and makes predictions. In some cases however, model interpretability is preferred, and sometimes legally mandatory.

Here is an interpretability-driven example often seen in the financial industry. Suppose a machine learning algorithm is used to accept or reject an individual’s credit card application. If the applicant is rejected and decides to file a complaint or take legal action, the financial institution will need to explain how that decision was made. While that can be nearly impossible for a neural network or SVM system, it’s relatively straightforward for decision tree-based algorithms.

In terms of training, testing, processing, and prediction speed, some algorithms and model types take more time, and require greater computing power and memory than others. In some applications, speed and scalability are critical factors, particularly in any widely used, near real-time application (e.g., eCommerce site) where a model needs to be updated fairly regularly, and that performs predictions and/or classifications at scale on the fly.

Lastly, and as previously mentioned, model simplicity (or parsimony) should always be preferred unless there is a significant and justifiable gain in performance accuracy. Simplicity usually results in quicker, more scalable, and easier to interpret models and results.


We’ve now had a solid overview of the machine learning process from selecting data and features, through selecting appropriate models for a given problem type.

Part three of this series will continue with the machine learning process, and in particular will focus on model evaluation, performance, improvement, complexity, validation, and more.

Stay tuned!

Click here to visit my GitHub repo of resources related to data science, machine learning, artificial intelligence (AI), big data, internet of things (IoT), and more!