Tag: python

I analyzed Facebook data to decide when to stream on Facebook Live. Here’s how.

I analyzed Facebook data to decide when to stream on Facebook Live. Here’s how.

Streaming on Facebook Live can be a powerful marketing strategy for startups and businesses to share knowledge, provide value, get exposure and collect high-quality leads. When preparing your Facebook Live session upfront by researching about your target audience and building a detailed agenda, the session can boost your business dramatically.

As a chief of product and technology of my previous startup dealing with fraud detection, I decided to try Facebook Live as a new marketing strategy when it was still fairly new. Back then, once a new Facebook Live session was up, relevant people got Facebook notifications to join the session, which increased the exposure even more.

There are many posts talking about how to better build your Facebook live session in terms of topics to cover, building an agenda, camera angles, session duration, etc. But there is one piece of the puzzle that business owners and marketers often tend to forget or not pay attention to – when is the best time to stream your Facebook Live session?

Facebook live

Answering this question can be done using an educated guess based on your familiarity with the target audience, For example:

  • Pregnant moms are ready to consume your Live session on Monday afternoon.
  • Teenagers at the ages of 18-22 are in the right mindset on Saturday morning.

But nowadays, when there is so much data around us that we can use with a few clicks of a button, you actually stay behind if you don’t make a proper usage of some of the data available.

Almost every marketing platform or social network opens API services that you, as a technological entrepreneur, can easily consume. This data, when analyzed properly, can be derived to valuable conclusions that can drive your business objectives way beyond your competitors.

This approach is often called Data-driven decisions.

Once you start justifying any (or at least most) of your business decisions using data you own or data you can collect from different resources, you actually stop guessing and start making data-driven decisions.

I like to think of data-driven decisions as crowd-sourcing. If you had a chance to watch this TED talk by Lior Zoref, where Lior invited an ox to the stage and asked the audience to guess what its weight is, you were probably overwhelmed by how accurate the crowd’s average was compared to the real weight of the ox: 1,792lbs vs. 1795lbs!

Ox weight

When you’re making guesses regarding your business objectives as individuals, you’re not different than any individual sitting in the crowd and trying to evaluate the ox’s weight. You can even be the one who guessed 300lbs or 8000lbs, which will probably cost your business a lot of unnecessary expenses.

But, if you’re using the wisdom of the crowd to make data-driven decisions, which you can do in almost any decision you make online, you’ll most likely be ahead of every other individual, or in business terms, ahead of your competitors.

Although I’m not a pure marketer, with basic data analysis skills, I can push my business forward dramatically in all aspects, including marketing.

In this post, I’m going to walk you through a practical step-by-step guide about how to access Facebook data, analyze it based on our needs on deriving conclusions about when is the optimized time to broadcast on Facebook Live.

In order to follow this guide you need:

  • A Facebook account
  • A Facebook group you would like to analyze (if it’s a private one, you need to be a group member)
  • Python 2.7 installed
  • Jupyter notebook installed
  • Facebook graph API Python library installed

A Jupyter notebook as a highly recommended tool for data analysis in Python. It has a lot of highlights, but on top of all, it enables you to run snippets of code and save the results in memory, instead of running all of your scripts over and over again every time you implement a minor change. This is crucial when doing data analysis because some tasks can take a lot of execution time.

Although it’s not essential, I always recommend working inside a Python virtual environment. Here is a post I wrote about the advantages of a virtual environment when using Python.

Finally, I highly recommend working on an Ubuntu environment when doing data-analysis using Jupyter notebooks.

Step 1: Getting the Facebook group ID

In order to get data from Facebook API, we need to specify the ID of the entity we want to get data from, in our case, a Facebook group.

Lookup-id.com is a nice tool you can use to easily find the ID of a group based on its URL. Copy the URL of your group and paste it in the search bar.

lookup-id

In this post, we will use the group: Web Design and Development.

ID: 319479604815804

Step 2: Getting to know the Graph API Explorer

In order to get the most out of Facebook API, besides documentation, Facebook has developed a playground for developers called the Graph API Explorer.

The Graph API Explorer enables us to easily get a temporary access token and start examining the capabilities that Facebook API has to offer.

Click on Get Token, and then on Get User Access Token. Don’t select any permission, just click Get Access Token.

Get access token

Facebook API has many endpoints you can use, but in this guide, we are going to use two main endpoints:

In order to figure out the structure of the response you’re expecting to get from a specific endpoint, you just need to specify the endpoint URL and click Submit.

Let’s examine the URL endpoint for grabbing the last posts from the group’s feed. Type this URL in the Graph API Explorer:

319479604815804/feed

and hit Submit.

feed_endpoint

You should now see the last posts from the group’s feed in a JSON structure, containing the post’s content, its id and the updated time. By clicking on one of the id’s and adding to the end of the URL:

319479604815804_1468216989942054/reactions?summary=total_count

You should see a list of the reactions for the specific post, and a summary of the total count of reactions.

This way you can play around with all of the features the Facebook API has to offer.

Another wonderful tool for examining API endpoints of APIs which don’t offer a playground is Postman. You can read more about this tool, as well as essential tools for web developers here.

Step 3: Our plan and assumptions

Our goal is to find the optimized time interval to have a Facebook Live session in the chosen group that contains our target audience. In order to do that, we assume that we the more activity there is in the group at a specific time, the most likely our Facebook Live session with gain more traction.

So our goal now is to figure out when there is a peak in the group’s activity over time. And by when I mean a specific weekday and time.

In order to do that, we are going to grab the last 5000 posts from the group’s feed and plot the distribution of the times they were updated on.

We assume that longer posts indicate on more activity in the group because members spend more time in the group writing them. Therefore, our next step will be to take into consideration the length of each post in the distribution.

Reaction on Facebook is probably a great indication of people engaging with a specific post. Therefore, our last step will be to collect the total number of reactions for each post and take it into account in the distribution of activity over weekdays and hours.

Because Reactions probably come after the post, we should be cautious using this data analysis approach.

Step 4: Let’s analyze some data!

In order to start a Jupyter notebook, you should probably execute:

ipython notebook

and then choose New → Python 2.

new_notebook

In order to analyze and plot the data, we are going to use numpy and matplotlib libraries. These are very popular Python libraries you should use in order to better analyze your data.

Let’s import all the libraries we need:

import matplotlib.pyplot as plt
import numpy as np
import facebook
import urlparse
import datetime
import requests

and specify our access token and group id:

ACCESS_TOKEN = 'INSERT_ACCESS_TOKEN_HERE'
GROUP_ID = '319479604815804' # Web Design and Development group

Then, let’s initialize the API object with our access token:

graph = facebook.GraphAPI(ACCESS_TOKEN)

Now we want to grab the posts from the group’s feed. In order to avoid errors during the API calls, we will limit each API call to 50 posts and iterate over 100 API calls:

posts = []
url = "{}/feed?limit=50".format(GROUP_ID)
until = None
for i in xrange(100):
    if until is not None:
        url += "&until={}".format(until)
    response = graph.request(url)
    data = response.get('data')
    if not data:
        break
    posts = posts + data
    next_url = response.get("paging").get("next")
    parsed_url = urlparse.urlparse(next_url)
    until = urlparse.parse_qs(parsed_url.query)["until"][0]

In each API call, we specify the until parameter to get older posts.

Now, let’s organize the posts into weekdays and hours of the day:

weekdays = {i: 0 for i in xrange(7)}

hours_of_day = {i: 0 for i in xrange(24)}

hours_of_week = np.zeros((7,24), dtype=np.int)
for post in posts:
    updated = datetime.datetime.strptime(post.get("updated_time"), "%Y-%m-%dT%H:%M:%S+0000")
    weekday = updated.weekday()
    hour_of_day = updated.hour
    weekdays[weekday] += 1
    hours_of_day[hour_of_day] += 1
    hours_of_week[weekday][hour_of_day] += 1

and then, plot the results using matplotlib bar charts:

plt.bar(weekdays.keys(), weekdays.values(), color='g')
plt.show()

weekdays_1

(0 represents Monday)

plt.bar(hours_of_day.keys(), hours_of_day.values(), color='r')
plt.show()

hours_1

All times specified in IST.

Only with this very basic analysis, we can already learn a lot about better and or worse time slots for broadcasting on this group. But it seems not informative enough. Maybe because the data is divided into 2 graphs and missing some critical information.

Let’s try to present a heat map of the data, that enables us to see 3D information:

plt.imshow(hours_of_week, cmap='hot')
plt.show()

heatmap_1

Well, this is much better! We can clearly see that the group is very active on Mondays to Fridays between 6 am and 10 am.

Now let’s try to take into consideration to post length and see how it affects the results:

weekdays_content = {i: 0 for i in xrange(7)}
hours_of_day_content = {i: 0 for i in xrange(24)}
hours_of_week_content = np.zeros((7,24), dtype=np.int)
for post in posts:
    updated = datetime.datetime.strptime(post.get("updated_time"), "%Y-%m-%dT%H:%M:%S+0000")
    weekday = updated.weekday()
    hour_of_day = updated.hour
    content_length = len(post["message"]) if "message" in post else 1
    weekdays_content[weekday] += content_length
    hours_of_day_content[hour_of_day] += content_length
    hours_of_week_content[weekday][hour_of_day] += content_length

The heatmap we get:

heatmap2

This is nice but should be treated with caution. On one hand, we can clearly see a very specific point in time that is the optimized time slot to have our Facebook Live session. On the other hand, it might be an outlier of a super long post.

I’ll leave it to you to figure it out in your next data analysis project. What I suggest you do is to take a larger amount of posts or grab an older batch of 5000 posts from the group’s feed.

In order to take reactions into account when analyzing the data, we need to make another API call for each post, because it’s a different API endpoint:

weekdays_reactions = {i: 0 for i in xrange(7)}
hours_of_day_reactions = {i: 0 for i in xrange(24)}
hours_of_week_reactions = np.zeros((7,24), dtype=np.int)
for i, post in enumerate(posts):
    url = "https://graph.facebook.com/v2.10/{id}/reactions?access_token={token}&summary=total_count".format(
    id=post["id"],
        token=ACCESS_TOKEN
    )

    headers = {
        "Host": "graph.facebook.com"
    }

    response = requests.get(url, headers=headers)

    try:
        total_reactions = 1 + response.json().get("summary").get("total_count")
    except:
        total_reactions = 1

    updated = datetime.datetime.strptime(post.get("updated_time"), "%Y-%m-%dT%H:%M:%S+0000")
    weekday = updated.weekday()
    hour_of_day = updated.hour
    weekdays_reactions[weekday] += total_reactions
    hours_of_day_reactions[hour_of_day] += total_reactions
    hours_of_week_reactions[weekday][hour_of_day] += total_reactions

The reason we used a low-level approach here by specifying the exact HTTP request and not using the Facebook Python library is that it doesn’t support the last version of the Facebook API that is required when querying the Reactions endpoint.

The heat map generated from this data:

heatmap_3

We can conclude that the three approaches we used agreed on Monday and Wednesday, 6-7am.

Conclusions

Data analysis can be challenging and often requires creativity. But it also exciting and very rewarding.

After choosing our time to broadcast on Facebook Live based on the analysis presented here, we had a huge success and a lot of traction during our Live session.

I encourage you to try and use data analysis to make data-driven decisions in your next business move. And on top of that, start thinking in terms of data-driven decisions.

You can find the Github repository here.

 

Mastering Python Web Scraping: Get Your Data Back

Mastering Python Web Scraping: Get Your Data Back

Do you ever find yourself in a situation where you need to get information out of a website that conveniently doesn’t have an export option?

This happened to a client of mine who desperately needed lists of email addresses from a platform that did not allow you to export your own data and hid the data behind a series of UI hurdles. This client was about to pay out the nose for a data-entry worker to copy each email out by hand. Luckily, she remembered that web scraping is the way of the future and happens to be one of my favorite ways to rebel against “big brother”. I hacked something out fast (15 minutes) and saved her a lot of money. I know others out there face similar issues. So I wanted to share how to write a program that uses the web browser like you would and takes (back) the data!

We will practice this together with a simple example: scraping a Google search. Sorry, not very creative 🙂 But it’s a good way to start.

 

Requirements

  • Python (I use 2.7)
    • Splinter (based on Selenium)
    • Pandas
  • Chrome
  • Chromedriver

If you don’t have Pandas and are lazy, I recommend heading over to Anaconda to get their distribution of Python that includes this essential & super useful library.

Otherwise, download it with pip from the terminal/command line & all of its dependencies.
pip install pandas

If you don’t have Splinter (and are not using Anaconda’s Python), simply download it with pip from the terminal/command line.
pip install splinter

If you don’t have Splinter (and are using Anaconda’s Python), download it with Anaconda’s package manager from the terminal/command line.
conda install splinter

If you want to set this up in a virtual environment (which has many advantages) but don’t know where to start, try reading our other blog post about virtual environments.

 

Step 1: The Libraries & Browser

Here we will import all the libraries we need and set up a browser object.

from splinter import Browser
import pandas as pd

# open a browser
browser = Browser('chrome')

If the page you are trying to scrape is responsive, use set_window_size to ensure all the elements you need are displayed.

# Width, Height
browser.driver.set_window_size(640, 480)

The code above will open a Google Chrome browser. Now that the browser is all set up, let’s visit Google.

browser.visit('https://www.google.com')

 

Step 2: Explore the Website

Great, so far we have made it to the front page. Now we need to focus on how to navigate the website. There are two main steps to achieving this:

  1. Find something (an HTML element)
  2. Perform an action on it

To find an HTML element you need to use the Chrome developer tools. Right click on the website and select “Inspect”. This will open a box on the right side of the Chrome browser. Then click on the inspect icon (highlighted in red).

Next use the inspector cursor to click on a section of the website that you want to control. When you have clicked, the HTML that creates that section will be highlighted on the right. In the photo below, I have clicked on the search bar which is an input.

Next right click on the HTML element, and select under “Copy” -> “Copy XPath”

Congrats! You’ve now got the keys to the kingdom. Let’s move on to how to use Splinter to control that HTML element from Python.

 

Step 3: Control the Website

That XPath is the most important piece of information! First, keep this XPath safe by pasting into a variable in Python.

# I recommend using single quotes
search_bar_xpath = '//*[@id="lst-ib"]'

Next we will pass this XPath to a great method from the Splinter Browser object: find_by_xpath(). This method will extract all the elements that match the XPath you pass it and return a list of Element objects. If there is only one element, it will return a list of length 1. There are other methods such as find_by_tag(), find_by_name(), find_by_text(), etc.

# I recommend using single quotes
search_bar_xpath = '//*[@id="lst-ib"]'

# index 0 to select from the list
search_bar = browser.find_by_xpath(search_bar_xpath)[0]

The code above now gives you navigation of this individual HTML element. There are two useful methods I use for crawling: fill() and click()

search_bar.fill("CodingStartups.com")

# Now let's set up code to click the search button!
search_button_xpath = '//*[@id="tsf"]/div[2]/div[3]/center/input[1]'
search_button = browser.find_by_xpath(search_button_xpath)[0]
search_button.click()

The code above types CodingStartups.com into the search bar and clicks the search button. Once you execute the last line, you will be brought to the search results page!

Tip: Use fill() and click() to navigate login pages 😉

 

Step 4: Scrape!

For the purpose of this exercise, we will scrape off the titles and links for each search result on the first page.

Notice that each search result is stored within a h3-tag with a class “r”. Also take note that both the title and the link we want is stored within an a-tag.

The XPath of that highlighted a tag is: //*[@id=”rso”]/div/div/div[1]/div/div/h3/a

But this is just the first link. We want all of the links on the search page, not just the first one. So we are going to change this a bit to make sure our find_by_xpath method returns all of the search results in a list. Here is how to do it. See the code below:

search_results_xpath = '//h3[@class="r"]/a'  # simple, right? 
search_results = browser.find_by_xpath(search_results_xpath)

This XPath tells Python to look for all h3-tags with a class “r”. Then inside each of them, extract the a-tag & all its data.

Now, lets iterate through the search result link elements that the find_by_xpath method returned. We will extract the title and link for each search result. It’s very simple:

scraped_data = []
for search_result in search_results:
     title = search_result.text.encode('utf8')  # trust me
     link = search_result["href"]
     scraped_data.append((title, link))  # put in tuples

Cleaning the data in search_result.text can sometimes be the most frustrating part. Text on the web is very messy. Here are some helpful methods for cleaning data:

.replace()

.encode()

.strip()

All of the titles and links are now in the scraped_data list. Now to export our data to csv. Instead of the csv library chaos, I like to use a pandas dataframe. It’s 2 lines:

df = pd.DataFrame(data=scraped_data, columns=["Title", "Link"])
df.to_csv("links.csv")

The code above creates a csv file with the headers Title, Link and then all of the data that was in the scraped_data list. Congrats! Now go forth and take (back) the data!

In case you want a big picture view, here is the full code available on our GitHub account.

Web Scraping is an essential web development skill. Want to get ahead in web development? Read this blog post about the best tools for web development workflow.

 

LaurenGlass

Thanks for reading! My name is Lauren Glass. I am an entrepreneur, data scientist, & developer living in Tel Aviv.

Check out more of my writing at CodingStartups

Follow my adventures on Instagram @ljglass

Machine Learning for hackers: model comparison and selection

Machine Learning for hackers: model comparison and selection

As technological entrepreneurs, machine learning is all over us. The science of training machines to learn and produce models for future predictions is widely used, and not for nothing. Less complicated code and more advanced learning algorithms and statistical methods are introduced for better solutions of our problems.

As broadly discussed in my post about machine learning 101 and linear regression, the problems that we try to solve using machine learning can be divided into two main types: supervised machine learning vs. unsupervised machine learning. Supervised learners learn from labeled data, that is, for example, data about house characteristics which contains also house price, for house price predictions. In other words, supervised machine learning learns labeled data-points and predicts labels for future ones.

On the other hand, unsupervised learning learns from unlabeled data and cannot predict labels for future data-points. It is commonly used for dimension reduction of the data, clustering the data, and more.

In this post, we will discuss supervised learning related problems, models and methods. I assume that you’re already familiar with some machine learning methods like linear regression, Ridge regression and Lasso by knowing how to train models using some of these methods.

This post is called machine learning for hackers in order to emphasize that developers can train models and use machine learning and make the most out of it without being professional data scientists. Although there are tons of tools and libraries out there to train machine learning models with under 10 lines of code, as a data hacker you need to be familiar with more than just training models. You need to know how to evaluate, compare and choose the best one that fits your specific dataset.

Usually, when working on a machine learning problem with a given dataset, we try different models and techniques to solve an optimization problem and fit the most accurate model, that will neither overfit nor underfit.

When dealing with real world problems, we usually have dozens of features in our dataset. Some of them might be very descriptive, some may overlap and some might even add more noise than signal to our data.

Using prior knowledge about the industry we work in for choosing the features is great, but sometimes we need a hand from analytical tools to better choose our features and compare between the models trained using different algorithms.

My goal here is to introduce you to the most common techniques and criteria for comparing between the models you trained in order to choose the most accurate one for your problem.

In particular, we are going to see how to choose between different models that were trained with the same algorithm. Assuming we have a dataset of 1 feature per data-point that we would like to fit using linear regression. Our goal is to choose the best polynomial degree for fitting the model out of 8 different assumptions.

The Problem (and the dataset)

We have been asked to predict house prices based on their size only. The dataset that was provided us contains house sizes as well as prices of 1,200 houses in NYC. We would like to try and use linear regression to fit a model for predicting future house prices when prior knowledge has been given to us about a few assumptions for model alternatives:

Ŷ1 = β0+β1X
Ŷ2 = β0+β1X+β1X2
Ŷ3 = β0+β1X+β2X2+β3X3
Ŷ4 = β0+β1X+β2X2+β3X3+β4X4
Ŷ5 = β0+β1X+β2X2+β3X3+β4X4+β5X5
Ŷ6 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6
Ŷ7 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6+β7X7
Ŷ8 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6+β7X7+β8X8

Where X represents the house size.

Given the 8 model alternatives, we have been asked to compare between the models using some criteria and choose the polynomial degree that best suits our dataset to predict future house prices.

As described in my previous post, complex models tend to overfit. Therefore, we need to be careful when choosing our model so it’ll provide us with good predictions not only for the current dataset but also for future data-points.

What is a train/test split and why we need it

When dealing with real world machine learning problems, our dataset is limited in its size. With a relatively small dataset, we want to train our model as well as to evaluate the accuracy of it. This is where train/test split comes handy.

A train/test split is a method for splitting our dataset into two groups, a training group of data-points that will be used to train the model, and a testing group that will be used to test it. We usually tend to split it inequality because training the model usually requires as much data-points as possible.

The common splits are 70/30 or 80/20 for train/test.

How NOT to compare the models

The most basic metric for evaluating trained machine learning models is the MSE. MSE stands for mean squared error and is given by the average of the squares of the errors. In other words, the MSE is the difference between the predicted value and the real value so we would like to minimize it when training models:

MSE = ( Ŷi- Yi)2/n

where n is the number of data-points.

The MSE should be used with caution. The reason for that is that the MSE can be calculated either on the training data-points or on the testing data-points. If you haven’t guessed by now, the correct way for using the MSE to evaluate your model is training our model using our training dataset and calculating the MSE using our testing dataset.

Without a train/test split of our data, we will be forced to calculate the MSE on the same dataset we trained the model with. This scenario will cause an immediate overfit. Why?

Assuming we haven’t split the dataset into train and test, trained 8 models (as described above) and calculated the MSEtrain for each of the models. Which model will provide us with the lowest MSEtrain? Most likely that model #8 because it’s the most complex one the overfit the data rather than learn the data.

Because we train and test the model with the exact same dataset, the MSEtrain will be lower as we use more complex models that will fit the training data better (Don’t forget that the optimization problem we try to solve is to minimize the errors between the predictions and the ground truth).

MSE

So we learned that we better use the MSE for testing our dataset after we split it. But there are more advanced criteria for evaluating our models (that are based on the MSE) which we usually use instead of just the MSE alone.

Which criteria we can use

After realizing why we need to split our data into train and test, and what the MSE means, we will cover 3 main criteria for comparing our 8 different models. These criteria know how to handle overfit and how to choose the best model for our dataset.

#1: Mallows’s Cp

Cp is a statistical method suggested by Mallows in 1973 to compute the expectation of the bias. Assuming we have a very small dataset such that splitting it into train and test does not make sense, we can use Cp in order to estimate the MSEtest using the MSEtrain calculated on the training dataset.

The Cp criteria or the estimator for MSEtest is given by:

Cp = MSEtrain + 2σ2P/n

Where σ2 is the error variance based on the full model (model number #8), and P is the number of predictors.

In order to use Mallows’s Cp to compare between our models, we need to train each model on the full dataset, calculate Mallows’s Cp estimator for each of the trained models, and choose the model with the lowest Cp result.

Cp

As we can see, while MSEtrain decreases as the polynomial degree increases (more complex model) which cannot indicate on the model we should choose, MSEtest and Mallows’s Cp criteria both choose model #3 to be the best model based on our dataset.

Note: Mallows’s Cp wasn’t developed to evaluate models that are not trained using linear regression. Therefore, you must not use it with models trained using other machine learning algorithms.

#2: BIC

We’ve already mentioned that when fitting models, adding parameters and making the model more complex can result in overfitting. BIC is a statistical criterion attempt to resolve this problem by introducing a penalty term for the number of parameters in the model.

BIC, which stands for Bayesian Information Criterion, assumes that there is a correct model among the suggested models and its goal is to choose it.

The mathematical form is very similar to Mallows’s Cp and is given by:

BIC = MSEtrain + log(n)σ2P/n

The model with the lowest BIC is preferred.

BIC

Leaving the values of MSEtrain aside, all other criteria choose model #3 to best fit out data unanimously.

Note: when there we are not sure about whether one of the suggested models is correct, BIC can behave in an unexpected way. Therefore, in real world problem, we should use it with caution.

#3: Cross Validation – probably the most popular criteria

Dealing with machine learning problems requires a good understanding of cross validation (or CV). Cross validation is used in many different ways in machine learning when they are all related to comparison and selection of parameters and models.

Cross validation is basically an extension of a train/test split methodology. The advantage of it though, is that it randomly divides the dataset multiple times, and in each time it trains the tests the model on a slightly different dataset.

By doing that, we make sure that we don’t evaluate the model’s error based on outliers or data that doesn’t represent the signal properly. We then average the MSEtest for each split to evaluate the model based on multiple train/test splits:

CV(n) = MSEi,test/n

The preferred model will be the one with the lowest CV(n). There is a critical point to understand here – there is a nested iteration when comparing models using cross validation – For each model, we split the dataset randomly, calculate the MSEi,test and then average them into a CV indicator. Therefore, we come up with a CV indicator for each model, in which based on it we choose the preferred model.

There are two main implementations for cross validation splits:

  1. Leave one out cross validation
  2. K-fold cross validation (the most popular)

Leave one out CV iterates over the dataset and takes out one data-point per iteration that will not be included in the training set but rather will be used to test the model’s performance.

LOOCV

K-fold CV gets a K parameter as an input, splits the dataset into K parts, iterates over the parts and for each iteration leaves the kth part out of training and use it as a testing set.

K-fold CV

Choosing the K parameter which is the number of folds can sometimes be tricky because it affects the bias-variance tradeoff on our data. A rule of thumb will be to choose either 5 or 10 (depends on the size of the dataset).

Cross validation is a wonderful tool used all over in machine learning and statistics. In the following chart, we can see how the different indicators estimate the models’ errors.

CV

Cross validation, unlike Cp or BIC, works well for most of the machine learning algorithms and doesn’t assume that there is one correct model.


I encourage you to try and plot each of these estimators in your next machine learning project when you’re asked to compare between models to figure out what works best for your specific situation.

In real world problems, we are usually asked to compare between models with more than one feature and the comparison can be done exactly the same way we covered in this post.

I encourage you also to check out Lasso method. Lasso has a built in mechanism of regularization that removes unnecessary features from our models, which is what we usually want to achieve when comparing different models.