Author: Lauren Glass

Mastering Python Web Scraping: Get Your Data Back

Mastering Python Web Scraping: Get Your Data Back

Do you ever find yourself in a situation where you need to get information out of a website that conveniently doesn’t have an export option?

This happened to a client of mine who desperately needed lists of email addresses from a platform that did not allow you to export your own data and hid the data behind a series of UI hurdles. This client was about to pay out the nose for a data-entry worker to copy each email out by hand. Luckily, she remembered that web scraping is the way of the future and happens to be one of my favorite ways to rebel against “big brother”. I hacked something out fast (15 minutes) and saved her a lot of money. I know others out there face similar issues. So I wanted to share how to write a program that uses the web browser like you would and takes (back) the data!

We will practice this together with a simple example: scraping a Google search. Sorry, not very creative 🙂 But it’s a good way to start.

 

Requirements

  • Python (I use 2.7)
    • Splinter (based on Selenium)
    • Pandas
  • Chrome
  • Chromedriver

If you don’t have Pandas and are lazy, I recommend heading over to Anaconda to get their distribution of Python that includes this essential & super useful library.

Otherwise, download it with pip from the terminal/command line & all of its dependencies.
pip install pandas

If you don’t have Splinter (and are not using Anaconda’s Python), simply download it with pip from the terminal/command line.
pip install splinter

If you don’t have Splinter (and are using Anaconda’s Python), download it with Anaconda’s package manager from the terminal/command line.
conda install splinter

If you want to set this up in a virtual environment (which has many advantages) but don’t know where to start, try reading our other blog post about virtual environments.

 

Step 1: The Libraries & Browser

Here we will import all the libraries we need and set up a browser object.

from splinter import Browser
import pandas as pd

# open a browser
browser = Browser('chrome')

If the page you are trying to scrape is responsive, use set_window_size to ensure all the elements you need are displayed.

# Width, Height
browser.driver.set_window_size(640, 480)

The code above will open a Google Chrome browser. Now that the browser is all set up, let’s visit Google.

browser.visit('https://www.google.com')

 

Step 2: Explore the Website

Great, so far we have made it to the front page. Now we need to focus on how to navigate the website. There are two main steps to achieving this:

  1. Find something (an HTML element)
  2. Perform an action on it

To find an HTML element you need to use the Chrome developer tools. Right click on the website and select “Inspect”. This will open a box on the right side of the Chrome browser. Then click on the inspect icon (highlighted in red).

Next use the inspector cursor to click on a section of the website that you want to control. When you have clicked, the HTML that creates that section will be highlighted on the right. In the photo below, I have clicked on the search bar which is an input.

Next right click on the HTML element, and select under “Copy” -> “Copy XPath”

Congrats! You’ve now got the keys to the kingdom. Let’s move on to how to use Splinter to control that HTML element from Python.

 

Step 3: Control the Website

That XPath is the most important piece of information! First, keep this XPath safe by pasting into a variable in Python.

# I recommend using single quotes
search_bar_xpath = '//*[@id="lst-ib"]'

Next we will pass this XPath to a great method from the Splinter Browser object: find_by_xpath(). This method will extract all the elements that match the XPath you pass it and return a list of Element objects. If there is only one element, it will return a list of length 1. There are other methods such as find_by_tag(), find_by_name(), find_by_text(), etc.

# I recommend using single quotes
search_bar_xpath = '//*[@id="lst-ib"]'

# index 0 to select from the list
search_bar = browser.find_by_xpath(search_bar_xpath)[0]

The code above now gives you navigation of this individual HTML element. There are two useful methods I use for crawling: fill() and click()

search_bar.fill("CodingStartups.com")

# Now let's set up code to click the search button!
search_button_xpath = '//*[@id="tsf"]/div[2]/div[3]/center/input[1]'
search_button = browser.find_by_xpath(search_button_xpath)[0]
search_button.click()

The code above types CodingStartups.com into the search bar and clicks the search button. Once you execute the last line, you will be brought to the search results page!

Tip: Use fill() and click() to navigate login pages 😉

 

Step 4: Scrape!

For the purpose of this exercise, we will scrape off the titles and links for each search result on the first page.

Notice that each search result is stored within a h3-tag with a class “r”. Also take note that both the title and the link we want is stored within an a-tag.

The XPath of that highlighted a tag is: //*[@id=”rso”]/div/div/div[1]/div/div/h3/a

But this is just the first link. We want all of the links on the search page, not just the first one. So we are going to change this a bit to make sure our find_by_xpath method returns all of the search results in a list. Here is how to do it. See the code below:

search_results_xpath = '//h3[@class="r"]/a'  # simple, right? 
search_results = browser.find_by_xpath(search_results_xpath)

This XPath tells Python to look for all h3-tags with a class “r”. Then inside each of them, extract the a-tag & all its data.

Now, lets iterate through the search result link elements that the find_by_xpath method returned. We will extract the title and link for each search result. It’s very simple:

scraped_data = []
for search_result in search_results:
     title = search_result.text.encode('utf8')  # trust me
     link = search_result["href"]
     scraped_data.append((title, link))  # put in tuples

Cleaning the data in search_result.text can sometimes be the most frustrating part. Text on the web is very messy. Here are some helpful methods for cleaning data:

.replace()

.encode()

.strip()

All of the titles and links are now in the scraped_data list. Now to export our data to csv. Instead of the csv library chaos, I like to use a pandas dataframe. It’s 2 lines:

df = pd.DataFrame(data=scraped_data, columns=["Title", "Link"])
df.to_csv("links.csv")

The code above creates a csv file with the headers Title, Link and then all of the data that was in the scraped_data list. Congrats! Now go forth and take (back) the data!

In case you want a big picture view, here is the full code available on our GitHub account.

Web Scraping is an essential web development skill. Want to get ahead in web development? Read this blog post about the best tools for web development workflow.

 

LaurenGlass

Thanks for reading! My name is Lauren Glass. I am an entrepreneur, data scientist, & developer living in Tel Aviv.

Check out more of my writing at CodingStartups

Follow my adventures on Instagram @ljglass

Recruit Developers: Must-Ask Questions

Recruit Developers: Must-Ask Questions

Have you ever heard the phrase, “it’s not the idea, it’s the people”? Every startup needs top talent to survive – especially top developers. But how do you find them? Interviews of course! I’ve interviewed multiple rounds of people to join an intensive coding community. We are looking for people that we can place into full-time developer positions at the end.

It is easy enough to find tough coding questions to test interviewees’ technical skills, but what about questions to determine characteristics that are hard to measure? Here are a list of questions to help you find those key people who will make your team a success. Many of these questions are good for non-developer roles as well. Print them out and bring them to your next interview!

Independence

  • Tell me about a project you worked on by yourself? What resources did you use? What was the outcome?
  • What skills do you want to gain but don’t have yet? How do you plan to get them?
  • When in your life did you first learn the value of independence? How have you used that knowledge since then?

Leadership

  • Tell us about a manager you liked, what did you learn from them?
  • Tell me about a project that you led? What worked? What was hard?
  • What do you do when there is a disagreement between group members? How did you learn these skills?
  • If you could lead any project, what would it be? Why?

Teamwork

  • Growing up, what was the activity that taught you the most about the value of teamwork?
  • What role do you normally take in a team environment?
  • What role is hardest for you?
  • What would past team members say about you?
  • What would you want your next manager to know about you?
  • What skills do you have that others may not? How do you normally teach other team members this skill? How do you utilize it to benefit a team?

Problem Solving

  • Tell us about a time that nothing worked, what did you do?
  • What online resource is your favorite for solving problems (other than stack overflow)?
  • Tell me about a time you had to ask for help? Who gave it to you and what happened?

Stress

  • What do you do to prepare for deadlines?
  • How do you de-stress?
  • What are your tips for handling pressure?
  • Tell me about a time you missed a deadline? What happened? What did you do to fix it?
  • Who do you know who handles pressure well? How do they do it? What do you admire about it?

Commitment

  • Tell us about the longest project you ever worked on?
  • How do you decide it is time to move on to another opportunity?
  • What project at our company would you like to be doing in 6 months? 1 year?
  • Where can you see room for improvement in our startup that you can own? How long do you think it would take?

Personality

  • Who is someone you admire? Why?
  • Tell us about a friend or colleague who is the most different from you. How did you work together? What did you like about them? What was hard?
  • What was one of the most transformative experiences in your life? What were you like before? How are you different after?
  • What personality skill do you want to develop?
  • What personality characteristics are hard for you to work with? Which are easier for you?

Management

  • What are the common characteristics of good managers?
  • How do you give feedback to people with more responsibilities than you?
  • How do you give feedback to people with fewer responsibilities than you?
  • What tips do you follow to stay organized? What other systems have you seen work?
  • We have this project coming up, how would you organize it? Who would you assign to what tasks? What extra resources would you need?

 

Ultimately, go with your instincts. Did you enjoy talking to the person? Don’t overlook red flags, dive in. Invite other people on your team to join the interview or have them talk later and focus on anything that gives you pause. Be approachable and respectful. Even if this person is not a good fit for your company, they may have a close friend who is. It is essential to leave things on a good note!

 

LaurenGlass

 

Lauren Glass is an entrepreneur, data scientist, & developer living in Tel Aviv.

Check out more of her writing at CodingStartups

Follow her adventures on Instagram @ljglass