## Machine Learning for hackers: model comparison and selection

As technological entrepreneurs, machine learning is all over us. The science of training machines to learn and produce models for future predictions is widely used, and not for nothing. Less complicated code and more advanced learning algorithms and statistical methods are introduced for better solutions of our problems.

As broadly discussed in my post about machine learning 101 and linear regression, the problems that we try to solve using machine learning can be divided into two main types: supervised machine learning vs. unsupervised machine learning. Supervised learners learn from labeled data, that is, for example, data about house characteristics which contains also house price, for house price predictions. In other words, supervised machine learning learns labeled data-points and predicts labels for future ones.

On the other hand, unsupervised learning learns from unlabeled data and cannot predict labels for future data-points. It is commonly used for dimension reduction of the data, clustering the data, and more.

In this post, we will discuss supervised learning related problems, models and methods. I assume that you’re already familiar with some machine learning methods like linear regression, Ridge regression and Lasso by knowing how to train models using some of these methods.

This post is called machine learning for hackers in order to emphasize that developers can train models and use machine learning and make the most out of it without being professional data scientists. Although there are tons of tools and libraries out there to train machine learning models with under 10 lines of code, as a data hacker you need to be familiar with more than just training models. You need to know how to evaluate, compare and choose the best one that fits your specific dataset.

Usually, when working on a machine learning problem with a given dataset, we try different models and techniques to solve an optimization problem and fit the most accurate model, that will neither overfit nor underfit.

When dealing with real world problems, we usually have dozens of features in our dataset. Some of them might be very descriptive, some may overlap and some might even add more noise than signal to our data.

Using prior knowledge about the industry we work in for choosing the features is great, but sometimes we need a hand from analytical tools to better choose our features and compare between the models trained using different algorithms.

My goal here is to introduce you to the most common techniques and criteria for comparing between the models you trained in order to choose the most accurate one for your problem.

In particular, we are going to see how to choose between different models that were trained with the same algorithm. Assuming we have a dataset of 1 feature per data-point that we would like to fit using linear regression. Our goal is to choose the best polynomial degree for fitting the model out of 8 different assumptions.

**The Problem (and the dataset)**

We have been asked to predict house prices based on their size only. The dataset that was provided us contains house sizes as well as prices of 1,200 houses in NYC. We would like to try and use linear regression to fit a model for predicting future house prices when prior knowledge has been given to us about a few assumptions for model alternatives:

Ŷ_{1}= β_{0}+β_{1}X Ŷ_{2}= β_{0}+β_{1}X+β_{1}X^{2}Ŷ_{3}= β_{0}+β_{1}X+β_{2}X^{2}+β_{3}X^{3}Ŷ_{4}= β_{0}+β_{1}X+β_{2}X^{2}+β_{3}X^{3}+β_{4}X^{4}Ŷ_{5}= β_{0}+β_{1}X+β_{2}X^{2}+β_{3}X^{3}+β_{4}X^{4}+β_{5}X^{5}Ŷ_{6}= β_{0}+β_{1}X+β_{2}X^{2}+β_{3}X^{3}+β_{4}X^{4}+β_{5}X^{5}+β_{6}X^{6}Ŷ_{7}= β_{0}+β_{1}X+β_{2}X^{2}+β_{3}X^{3}+β_{4}X^{4}+β_{5}X^{5}+β_{6}X^{6}+β_{7}X^{7}Ŷ_{8}= β_{0}+β_{1}X+β_{2}X^{2}+β_{3}X^{3}+β_{4}X^{4}+β_{5}X^{5}+β_{6}X^{6}+β_{7}X^{7}+β_{8}X^{8}

Where X represents the house size.

Given the 8 model alternatives, we have been asked to compare between the models using some criteria and choose the polynomial degree that best suits our dataset to predict future house prices.

As described in my previous post, complex models tend to overfit. Therefore, we need to be careful when choosing our model so it’ll provide us with good predictions not only for the current dataset but also for future data-points.

**What is a train/test split and why we need it**

When dealing with real world machine learning problems, our dataset is limited in its size. With a relatively small dataset, we want to train our model as well as to evaluate the accuracy of it. This is where train/test split comes handy.

A train/test split is a method for splitting our dataset into two groups, a training group of data-points that will be used to train the model, and a testing group that will be used to test it. We usually tend to split it inequality because training the model usually requires as much data-points as possible.

The common splits are 70/30 or 80/20 for train/test.

**How NOT to compare the models**

The most basic metric for evaluating trained machine learning models is the MSE. MSE stands for mean squared error and is given by the average of the squares of the errors. In other words, the MSE is the difference between the predicted value and the real value so we would like to minimize it when training models:

MSE = ∑( Ŷ_{i}- Y_{i})^{2}/n

where n is the number of data-points.

The MSE **should be used with caution.** The reason for that is that the MSE can be calculated either on the training data-points or on the testing data-points. If you haven’t guessed by now, the correct way for using the MSE to evaluate your model is training our model using our training dataset and calculating the MSE using our **testing** dataset.

Without a train/test split of our data, we will be forced to calculate the MSE on the same dataset we trained the model with. This scenario will cause an immediate overfit. Why?

Assuming we haven’t split the dataset into train and test, trained 8 models (as described above) and calculated the MSE_{train} for each of the models. Which model will provide us with the lowest MSE_{train}? Most likely that model #8 because it’s the most complex one the overfit the data rather than learn the data.

Because we train and test the model with the exact same dataset, the MSE_{train} will be lower as we use more complex models that will fit the training data better (Don’t forget that the optimization problem we try to solve is to minimize the errors between the predictions and the ground truth).

So we learned that we better use the MSE for testing our dataset after we split it. But there are more advanced criteria for evaluating our models (that are based on the MSE) which we usually use instead of just the MSE alone.

**Which criteria we can use**

After realizing why we need to split our data into train and test, and what the MSE means, we will cover 3 main criteria for comparing our 8 different models. These criteria know how to handle overfit and how to choose the best model for our dataset.

**#1: Mallows’s Cp**

Cp is a statistical method suggested by Mallows in 1973 to compute the expectation of the bias. Assuming we have a very small dataset such that splitting it into train and test does not make sense, we can use Cp in order to estimate the MSE_{test} using the MSE_{t}_{rain} calculated on the training dataset.

The Cp criteria or the estimator for MSE_{test} is given by:

Cp = MSE_{train}+ 2σ^{2}P/n

Where σ^{2} is the error variance based on the full model (model number #8), and P is the number of predictors.

In order to use Mallows’s Cp to compare between our models, we need to train each model on the full dataset, calculate Mallows’s Cp estimator for each of the trained models, and choose the model with the lowest Cp result.

As we can see, while MSE_{train} decreases as the polynomial degree increases (more complex model) which cannot indicate on the model we should choose, MSE_{test} and Mallows’s Cp criteria both choose model #3 to be the best model based on our dataset.

**Note:** Mallows’s Cp wasn’t developed to evaluate models that are not trained using linear regression. Therefore, you must not use it with models trained using other machine learning algorithms.

**#2: BIC**

We’ve already mentioned that when fitting models, adding parameters and making the model more complex can result in overfitting. BIC is a statistical criterion attempt to resolve this problem by introducing a penalty term for the number of parameters in the model.

BIC, which stands for Bayesian Information Criterion, assumes that there is a correct model among the suggested models and its goal is to choose it.

The mathematical form is very similar to Mallows’s Cp and is given by:

BIC = MSE_{train}+ log(n)σ^{2}P/n

The model with the lowest BIC is preferred.

Leaving the values of MSE_{train} aside, all other criteria choose model #3 to best fit out data unanimously.

**Note:** when there we are not sure about whether one of the suggested models is correct, BIC can behave in an unexpected way. Therefore, in real world problem, we should use it with caution.

**#3: Cross Validation – probably the most popular criteria **

Dealing with machine learning problems requires a good understanding of cross validation (or CV). Cross validation is used in many different ways in machine learning when they are all related to comparison and selection of parameters and models.

Cross validation is basically an extension of a train/test split methodology. The advantage of it though, is that it randomly divides the dataset multiple times, and in each time it trains the tests the model on a slightly different dataset.

By doing that, we make sure that we don’t evaluate the model’s error based on outliers or data that doesn’t represent the signal properly. We then average the MSE_{test} for each split to evaluate the model based on multiple train/test splits:

CV_{(n)}= ∑MSE_{i,test}/n

The preferred model will be the one with the lowest CV_{(n)}. There is a critical point to understand here – there is a nested iteration when comparing models using cross validation – For each model, we split the dataset randomly, calculate the MSE_{i,test} and then average them into a CV indicator. Therefore, we come up with a CV indicator for each model, in which based on it we choose the preferred model.

There are two main implementations for cross validation splits:

- Leave one out cross validation
- K-fold cross validation (the most popular)

**Leave one out CV** iterates over the dataset and takes out one data-point per iteration that will not be included in the training set but rather will be used to test the model’s performance.

**K-fold CV** gets a K parameter as an input, splits the dataset into K parts, iterates over the parts and for each iteration leaves the kth part out of training and use it as a testing set.

Choosing the K parameter which is the number of folds can sometimes be tricky because it affects the bias-variance tradeoff on our data. A rule of thumb will be to choose either 5 or 10 (depends on the size of the dataset).

Cross validation is a wonderful tool used all over in machine learning and statistics. In the following chart, we can see how the different indicators estimate the models’ errors.

Cross validation, unlike Cp or BIC, works well for most of the machine learning algorithms and doesn’t assume that there is one correct model.

I encourage you to try and plot each of these estimators in your next machine learning project when you’re asked to compare between models to figure out what works best for your specific situation.

In real world problems, we are usually asked to compare between models with more than one feature and the comparison can be done exactly the same way we covered in this post.

I encourage you also to check out Lasso method. Lasso has a built in mechanism of regularization that removes unnecessary features from our models, which is what we usually want to achieve when comparing different models.