Category: Machine Learning

Machine Learning for hackers: model comparison and selection

Machine Learning for hackers: model comparison and selection

As technological entrepreneurs, machine learning is all over us. The science of training machines to learn and produce models for future predictions is widely used, and not for nothing. Less complicated code and more advanced learning algorithms and statistical methods are introduced for better solutions of our problems.

As broadly discussed in my post about machine learning 101 and linear regression, the problems that we try to solve using machine learning can be divided into two main types: supervised machine learning vs. unsupervised machine learning. Supervised learners learn from labeled data, that is, for example, data about house characteristics which contains also house price, for house price predictions. In other words, supervised machine learning learns labeled data-points and predicts labels for future ones.

On the other hand, unsupervised learning learns from unlabeled data and cannot predict labels for future data-points. It is commonly used for dimension reduction of the data, clustering the data, and more.

In this post, we will discuss supervised learning related problems, models and methods. I assume that you’re already familiar with some machine learning methods like linear regression, Ridge regression and Lasso by knowing how to train models using some of these methods.

This post is called machine learning for hackers in order to emphasize that developers can train models and use machine learning and make the most out of it without being professional data scientists. Although there are tons of tools and libraries out there to train machine learning models with under 10 lines of code, as a data hacker you need to be familiar with more than just training models. You need to know how to evaluate, compare and choose the best one that fits your specific dataset.

Usually, when working on a machine learning problem with a given dataset, we try different models and techniques to solve an optimization problem and fit the most accurate model, that will neither overfit nor underfit.

When dealing with real world problems, we usually have dozens of features in our dataset. Some of them might be very descriptive, some may overlap and some might even add more noise than signal to our data.

Using prior knowledge about the industry we work in for choosing the features is great, but sometimes we need a hand from analytical tools to better choose our features and compare between the models trained using different algorithms.

My goal here is to introduce you to the most common techniques and criteria for comparing between the models you trained in order to choose the most accurate one for your problem.

In particular, we are going to see how to choose between different models that were trained with the same algorithm. Assuming we have a dataset of 1 feature per data-point that we would like to fit using linear regression. Our goal is to choose the best polynomial degree for fitting the model out of 8 different assumptions.

The Problem (and the dataset)

We have been asked to predict house prices based on their size only. The dataset that was provided us contains house sizes as well as prices of 1,200 houses in NYC. We would like to try and use linear regression to fit a model for predicting future house prices when prior knowledge has been given to us about a few assumptions for model alternatives:

Ŷ1 = β0+β1X
Ŷ2 = β0+β1X+β1X2
Ŷ3 = β0+β1X+β2X2+β3X3
Ŷ4 = β0+β1X+β2X2+β3X3+β4X4
Ŷ5 = β0+β1X+β2X2+β3X3+β4X4+β5X5
Ŷ6 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6
Ŷ7 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6+β7X7
Ŷ8 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6+β7X7+β8X8

Where X represents the house size.

Given the 8 model alternatives, we have been asked to compare between the models using some criteria and choose the polynomial degree that best suits our dataset to predict future house prices.

As described in my previous post, complex models tend to overfit. Therefore, we need to be careful when choosing our model so it’ll provide us with good predictions not only for the current dataset but also for future data-points.

What is a train/test split and why we need it

When dealing with real world machine learning problems, our dataset is limited in its size. With a relatively small dataset, we want to train our model as well as to evaluate the accuracy of it. This is where train/test split comes handy.

A train/test split is a method for splitting our dataset into two groups, a training group of data-points that will be used to train the model, and a testing group that will be used to test it. We usually tend to split it inequality because training the model usually requires as much data-points as possible.

The common splits are 70/30 or 80/20 for train/test.

How NOT to compare the models

The most basic metric for evaluating trained machine learning models is the MSE. MSE stands for mean squared error and is given by the average of the squares of the errors. In other words, the MSE is the difference between the predicted value and the real value so we would like to minimize it when training models:

MSE = ( Ŷi- Yi)2/n

where n is the number of data-points.

The MSE should be used with caution. The reason for that is that the MSE can be calculated either on the training data-points or on the testing data-points. If you haven’t guessed by now, the correct way for using the MSE to evaluate your model is training our model using our training dataset and calculating the MSE using our testing dataset.

Without a train/test split of our data, we will be forced to calculate the MSE on the same dataset we trained the model with. This scenario will cause an immediate overfit. Why?

Assuming we haven’t split the dataset into train and test, trained 8 models (as described above) and calculated the MSEtrain for each of the models. Which model will provide us with the lowest MSEtrain? Most likely that model #8 because it’s the most complex one the overfit the data rather than learn the data.

Because we train and test the model with the exact same dataset, the MSEtrain will be lower as we use more complex models that will fit the training data better (Don’t forget that the optimization problem we try to solve is to minimize the errors between the predictions and the ground truth).


So we learned that we better use the MSE for testing our dataset after we split it. But there are more advanced criteria for evaluating our models (that are based on the MSE) which we usually use instead of just the MSE alone.

Which criteria we can use

After realizing why we need to split our data into train and test, and what the MSE means, we will cover 3 main criteria for comparing our 8 different models. These criteria know how to handle overfit and how to choose the best model for our dataset.

#1: Mallows’s Cp

Cp is a statistical method suggested by Mallows in 1973 to compute the expectation of the bias. Assuming we have a very small dataset such that splitting it into train and test does not make sense, we can use Cp in order to estimate the MSEtest using the MSEtrain calculated on the training dataset.

The Cp criteria or the estimator for MSEtest is given by:

Cp = MSEtrain + 2σ2P/n

Where σ2 is the error variance based on the full model (model number #8), and P is the number of predictors.

In order to use Mallows’s Cp to compare between our models, we need to train each model on the full dataset, calculate Mallows’s Cp estimator for each of the trained models, and choose the model with the lowest Cp result.


As we can see, while MSEtrain decreases as the polynomial degree increases (more complex model) which cannot indicate on the model we should choose, MSEtest and Mallows’s Cp criteria both choose model #3 to be the best model based on our dataset.

Note: Mallows’s Cp wasn’t developed to evaluate models that are not trained using linear regression. Therefore, you must not use it with models trained using other machine learning algorithms.

#2: BIC

We’ve already mentioned that when fitting models, adding parameters and making the model more complex can result in overfitting. BIC is a statistical criterion attempt to resolve this problem by introducing a penalty term for the number of parameters in the model.

BIC, which stands for Bayesian Information Criterion, assumes that there is a correct model among the suggested models and its goal is to choose it.

The mathematical form is very similar to Mallows’s Cp and is given by:

BIC = MSEtrain + log(n)σ2P/n

The model with the lowest BIC is preferred.


Leaving the values of MSEtrain aside, all other criteria choose model #3 to best fit out data unanimously.

Note: when there we are not sure about whether one of the suggested models is correct, BIC can behave in an unexpected way. Therefore, in real world problem, we should use it with caution.

#3: Cross Validation – probably the most popular criteria

Dealing with machine learning problems requires a good understanding of cross validation (or CV). Cross validation is used in many different ways in machine learning when they are all related to comparison and selection of parameters and models.

Cross validation is basically an extension of a train/test split methodology. The advantage of it though, is that it randomly divides the dataset multiple times, and in each time it trains the tests the model on a slightly different dataset.

By doing that, we make sure that we don’t evaluate the model’s error based on outliers or data that doesn’t represent the signal properly. We then average the MSEtest for each split to evaluate the model based on multiple train/test splits:

CV(n) = MSEi,test/n

The preferred model will be the one with the lowest CV(n). There is a critical point to understand here – there is a nested iteration when comparing models using cross validation – For each model, we split the dataset randomly, calculate the MSEi,test and then average them into a CV indicator. Therefore, we come up with a CV indicator for each model, in which based on it we choose the preferred model.

There are two main implementations for cross validation splits:

  1. Leave one out cross validation
  2. K-fold cross validation (the most popular)

Leave one out CV iterates over the dataset and takes out one data-point per iteration that will not be included in the training set but rather will be used to test the model’s performance.


K-fold CV gets a K parameter as an input, splits the dataset into K parts, iterates over the parts and for each iteration leaves the kth part out of training and use it as a testing set.

K-fold CV

Choosing the K parameter which is the number of folds can sometimes be tricky because it affects the bias-variance tradeoff on our data. A rule of thumb will be to choose either 5 or 10 (depends on the size of the dataset).

Cross validation is a wonderful tool used all over in machine learning and statistics. In the following chart, we can see how the different indicators estimate the models’ errors.


Cross validation, unlike Cp or BIC, works well for most of the machine learning algorithms and doesn’t assume that there is one correct model.

I encourage you to try and plot each of these estimators in your next machine learning project when you’re asked to compare between models to figure out what works best for your specific situation.

In real world problems, we are usually asked to compare between models with more than one feature and the comparison can be done exactly the same way we covered in this post.

I encourage you also to check out Lasso method. Lasso has a built in mechanism of regularization that removes unnecessary features from our models, which is what we usually want to achieve when comparing different models.

Practical machine learning: Ridge regression vs. Lasso

Practical machine learning: Ridge regression vs. Lasso

For many years, programmers have tried to solve extremely complex computer science problems using traditional algorithms which are based on the most basic condition statement: if this then that. For example, if the email contains the word “free!” it should be classified as spam.

In recent years, with the rise of exceptional cloud computing technologies, the machine learning approach for solving complex problems has been magnificently accelerated. Machine learning is the science of providing computers the ability to learn and solve problems without being explicitly programmed. Sounds like a black magic? Maybe. In this post, I will introduce you to problems which can be solved using machine learning, as well as practical machine learning solutions for solving them.

Exactly like humans learn on a daily basis, in order to let a machine to learn, you need to provide it with enough data. Once it processed the data, it can make predictions about the future. Assuming you want to classify emails by whether they are spam emails or not. In order to solve this problem using machine learning, you need to provide the machine with many labeled emails – which are already classified in the correct classes of spam vs. not spam. The classifier will iterate over the samples and learn what are the features that define a spam email. Assuming you trained the machine learning model right, it will be able to predict whether a future email should be classified as spam or not, with high accuracy. In many cases, you’ll not be able to completely understand how the model predicts the class.

Machine learning hierarchy

The world of machine learning can be divided into two types of problems: supervised learning and unsupervised learning. In this post, we will focus only on supervised learning, which is a subset of problems which contain labeled data (That is, every email is labeled as spam or not spam). For cases where you have unlabeled data, unsupervised learning might be a proper solution.

Underneath the supervised learning problems, there is another division of regression problems vs. classification problems. In regression problems, the value you wish to predict is continuous. For example, house price. In classification problems, on the other hand, the value you are about to predict is discrete, like spam vs. not spam.

The data you need to provide in order to train your model depends on the problem and the value you wish to predict. Let’s assume you want to predict a house price based on different properties. So in this case, each row in your dataset should (for example) consist of:

  1. features: house size, the number of rooms, floor, whether elevator exists, etc.
  2. label: house price.

Choosing and collecting the features that best describe a house for predicting its price can be challenging. It requires market knowledge as well as access to big data sources. The features are the keys in which the prediction of the house price will be based upon.

Machine learning as an optimization problem

Every machine learning problem is basically an optimization problem. That is, you wish to find either a maximum or a minimum of a specific function. The function that you want to optimize is usually called the loss function (or cost function). The loss function is defined for each machine learning algorithm you use, and this is the main metric for evaluating the accuracy of your trained model.

For the house price prediction example, after the model is trained, we are able to predict new house prices based on their features. For each house price we predict, denoted as Ŷi, and the actual house price Yi we can calculate the loss by:

li = ( Ŷi– Yi)2

This is the most basic form of a loss for a specific data-point, That is used mostly for linear regression algorithms. The loss function as a whole can be denoted as:

L = ( Ŷi– Yi)2

Which simply defines that our model’s loss is the sum of distances between the house price we’ve predicted and the ground truth. This loss function, in particular, is called quadratic loss or least squares. We wish to minimize the loss function (L) as much as possible so the prediction will be as close as possible to the ground truth.

If you followed me up until now, you are familiar with the basic concept of every practical machine learning problem. Remember, every machine learning algorithm defines its own loss function according to its goal in life.

Linear regression

Linear regression is a basic yet super powerful machine learning algorithm. As you gain more and more experience with machine learning, you’ll notice how simple is better than complex most of the time. Linear regression is widely used in different supervised machine learning problems, and as you may guessed already, it focuses on regression problem (the value we wish the predict is continuous). It is extremely important to have a good understanding of linear regression

before studying more complex learning methods. Many extensions have been developed for linear regression which I will introduce later in this post.

The most basic form of linear regression deals with dataset of a single feature per data point (think of it as the house size). Because we are dealing with supervised learning, each row (house) in the dataset should include the price of the house (which is the value we wish the predict).

An example of our dataset:

House size (X)

House price (Y)















In a visual representation:

In linear regression we wish to fit a function (model) in this form:

Ŷ = β0+β1X

Where X is the vector of features (the first column in the table below), and β0, β1 are the coefficients we wish to learn.

By learning the parameters I mean executing an iterative process that updates β at every step by reducing the loss function as much as possible. Once we reach the minimum point of the loss function we can say that we completed the iterative process and learned the parameters.

Just to make it even more clear, the combination of the β coefficients are our trained model – which means that we have a solution to the problem!

After executing the iterative process, we can visualize the solution on the same graph:


Where the trained model is:

Ŷ = -0.5243+1.987X

Now let’s assume we want to predict based on our trained model, what will be the price of a house of size 85. In order to predict the price, we will substitute the β values we found into the model function, including the house size, and get the predicted house price:

Ŷx=85 168.37

To recap what we’ve covered so far:

  1. Every machine learning problem is basically an optimization problem. That is, we want to minimize (or maximize) some function.
  2. Our dataset is consist of features (X) and a label (Y). In our case – house size is the single feature, house price is the label.
  3. In linear regression problems, we want to minimize the quadratic loss which is the sum of distances between the predictions and the actual value (ground truth).
  4. In order to minimize the loss function and find the optimal β coefficients, we will execute an iterative process.
  5. To predict the label (house price) of a new house based on its size, we will use the trained model.

The iterative process for minimizing the loss function (a.k.a learning the coefficients β), will be discussed in another post. Although it can be done with one line of code, I highly recommend reading more about iterative algorithms for minimizing loss functions like Gradient Descent.

Linear regression with multiple features

In real world problems, you usually have more than one feature per row (house). Let’s see how linear regression can help us with multi-feature problems.

Considering this dataset:

House size (X1)

rooms (X2)

floor (X3)

House price (Y)





























So currently we have 3 features:

  1. house size
  2. number of rooms
  3. floor

Therefore, we need to adapt our basic linear model to an extended one that can take into account the additional features for each house:

Ŷ = β0+β1X1+β2X2+β3X3

In order to solve the multi-feature linear regression problem, we will the same iterative algorithm and minimize the loss function. The main difference will be that we will end up with four β coefficients instead of only two.

Overfit in machine learning algorithms

Having more features may seem like a perfect way for improving the accuracy of our trained model (reducing the loss) – because the model that will be trained will be more flexible and will take into account more parameters. On the other hand, we need to be extremely careful about overfitting the data. As we know, every dataset has noisy samples. For example, the house size wasn’t measured accurately or the price is not up to date. The inaccuracies can lead to a low-quality model if not trained carefully. The model might end up memorizing the noise instead of learning the trend of the data.

A visual example of a nonlinear overfitted model:

Overfit can happen in linear models as well when dealing with multiple features. If not filtered and explored up front, some features can be more destructive than helpful, repeat information that already expressed by other features and add high noise to the dataset.

Overcoming overfit using regularization

Because overfit is an extremely common issue in many machine learning problems, there are different approaches to solving it. The main concept behind avoiding overfit is simplifying the models as much as possible. Simple models do not (usually) overfit. On the other hand, we need to pay attention the to gentle trade-off between overfitting and underfitting a model.

One of the most common mechanisms for avoiding overfit is called regularization. Regularized machine learning model, is a model that its loss function contains another element that should be minimized as well. Let’s see an example:

L = ( Ŷi– Yi)2 + λβ2

This loss function includes two elements. The first one is the one you’ve seen before – the sum of distances between each prediction and its ground truth. The second element though, a.k.a the regularization term, might seem a bit bizarre. It sums over squared β values and multiplies it by another parameter λ. The reason for doing that is to “punish” the loss function for high values of the coefficients β. As aforesaid, simple models are better than complex models and usually do not overfit. Therefore, we need to try and simplify the model as much as possible. Remember that our goal of the iterative process is to minimize the loss function. By punishing the β values we add a constraint to minimize them as much as possible.

There is a gentle trade-off between fitting the model, but not overfitting it. This approach is called Ridge regression.

Ridge regression

Ridge regression is an extension for linear regression. It’s basically a regularized linear regression model. The λ parameter is a scalar that should be learned as well, using a method called cross validation that will be discussed in another post.

A super important fact we need to notice about ridge regression is that it enforces the β coefficients to be lower, but it does not enforce them to be zero. That is, it will not get rid of irrelevant features but rather minimize their impact on the trained model.

Lasso method

Lasso is another extension built on regularized linear regression, but with a small twist. The loss function of Lasso is in the form:

L = ( Ŷi– Yi)2 + λ|β|

The only difference from Ridge regression is that the regularization term is in absolute value. But this difference has a huge impact on the trade-off we’ve discussed before. Lasso method overcomes the disadvantage of Ridge regression by not only punishing high values of the coefficients β but actually setting them to zero if they are not relevant. Therefore, you might end up with fewer features included in the model than you started with, which is a huge advantage.


Machine learning is getting more and more practical and powerful. With zero knowledge in programming, you can train a model to predict house prices in no time.

We’ve covered the basics of machine learning, loss function, linear regression, ridge and lasso extensions.

There is more Math involved from what I’ve covered in this post, I tried to keep it as practical and, on the other hand, high-level as possible (Someone said trade-off?).

I encourage you to take a deep dive into this amazing world.