Category: Web

Mastering Python Web Scraping: Get Your Data Back

Mastering Python Web Scraping: Get Your Data Back

Do you ever find yourself in a situation where you need to get information out of a website that conveniently doesn’t have an export option?

This happened to a client of mine who desperately needed lists of email addresses from a platform that did not allow you to export your own data and hid the data behind a series of UI hurdles. This client was about to pay out the nose for a data-entry worker to copy each email out by hand. Luckily, she remembered that web scraping is the way of the future and happens to be one of my favorite ways to rebel against “big brother”. I hacked something out fast (15 minutes) and saved her a lot of money. I know others out there face similar issues. So I wanted to share how to write a program that uses the web browser like you would and takes (back) the data!

We will practice this together with a simple example: scraping a Google search. Sorry, not very creative 🙂 But it’s a good way to start.

 

Requirements

  • Python (I use 2.7)
    • Splinter (based on Selenium)
    • Pandas
  • Chrome
  • Chromedriver

If you don’t have Pandas and are lazy, I recommend heading over to Anaconda to get their distribution of Python that includes this essential & super useful library.

Otherwise, download it with pip from the terminal/command line & all of its dependencies.
pip install pandas

If you don’t have Splinter (and are not using Anaconda’s Python), simply download it with pip from the terminal/command line.
pip install splinter

If you don’t have Splinter (and are using Anaconda’s Python), download it with Anaconda’s package manager from the terminal/command line.
conda install splinter

If you want to set this up in a virtual environment (which has many advantages) but don’t know where to start, try reading our other blog post about virtual environments.

 

Step 1: The Libraries & Browser

Here we will import all the libraries we need and set up a browser object.

from splinter import Browser
import pandas as pd

# open a browser
browser = Browser('chrome')

If the page you are trying to scrape is responsive, use set_window_size to ensure all the elements you need are displayed.

# Width, Height
browser.driver.set_window_size(640, 480)

The code above will open a Google Chrome browser. Now that the browser is all set up, let’s visit Google.

browser.visit('https://www.google.com')

 

Step 2: Explore the Website

Great, so far we have made it to the front page. Now we need to focus on how to navigate the website. There are two main steps to achieving this:

  1. Find something (an HTML element)
  2. Perform an action on it

To find an HTML element you need to use the Chrome developer tools. Right click on the website and select “Inspect”. This will open a box on the right side of the Chrome browser. Then click on the inspect icon (highlighted in red).

Next use the inspector cursor to click on a section of the website that you want to control. When you have clicked, the HTML that creates that section will be highlighted on the right. In the photo below, I have clicked on the search bar which is an input.

Next right click on the HTML element, and select under “Copy” -> “Copy XPath”

Congrats! You’ve now got the keys to the kingdom. Let’s move on to how to use Splinter to control that HTML element from Python.

 

Step 3: Control the Website

That XPath is the most important piece of information! First, keep this XPath safe by pasting into a variable in Python.

# I recommend using single quotes
search_bar_xpath = '//*[@id="lst-ib"]'

Next we will pass this XPath to a great method from the Splinter Browser object: find_by_xpath(). This method will extract all the elements that match the XPath you pass it and return a list of Element objects. If there is only one element, it will return a list of length 1. There are other methods such as find_by_tag(), find_by_name(), find_by_text(), etc.

# I recommend using single quotes
search_bar_xpath = '//*[@id="lst-ib"]'

# index 0 to select from the list
search_bar = browser.find_by_xpath(search_bar_xpath)[0]

The code above now gives you navigation of this individual HTML element. There are two useful methods I use for crawling: fill() and click()

search_bar.fill("CodingStartups.com")

# Now let's set up code to click the search button!
search_button_xpath = '//*[@id="tsf"]/div[2]/div[3]/center/input[1]'
search_button = browser.find_by_xpath(search_button_xpath)[0]
search_button.click()

The code above types CodingStartups.com into the search bar and clicks the search button. Once you execute the last line, you will be brought to the search results page!

Tip: Use fill() and click() to navigate login pages 😉

 

Step 4: Scrape!

For the purpose of this exercise, we will scrape off the titles and links for each search result on the first page.

Notice that each search result is stored within a h3-tag with a class “r”. Also take note that both the title and the link we want is stored within an a-tag.

The XPath of that highlighted a tag is: //*[@id=”rso”]/div/div/div[1]/div/div/h3/a

But this is just the first link. We want all of the links on the search page, not just the first one. So we are going to change this a bit to make sure our find_by_xpath method returns all of the search results in a list. Here is how to do it. See the code below:

search_results_xpath = '//h3[@class="r"]/a'  # simple, right? 
search_results = browser.find_by_xpath(search_results_xpath)

This XPath tells Python to look for all h3-tags with a class “r”. Then inside each of them, extract the a-tag & all its data.

Now, lets iterate through the search result link elements that the find_by_xpath method returned. We will extract the title and link for each search result. It’s very simple:

scraped_data = []
for search_result in search_results:
     title = search_result.text.encode('utf8')  # trust me
     link = search_result["href"]
     scraped_data.append((title, link))  # put in tuples

Cleaning the data in search_result.text can sometimes be the most frustrating part. Text on the web is very messy. Here are some helpful methods for cleaning data:

.replace()

.encode()

.strip()

All of the titles and links are now in the scraped_data list. Now to export our data to csv. Instead of the csv library chaos, I like to use a pandas dataframe. It’s 2 lines:

df = pd.DataFrame(data=scraped_data, columns=["Title", "Link"])
df.to_csv("links.csv")

The code above creates a csv file with the headers Title, Link and then all of the data that was in the scraped_data list. Congrats! Now go forth and take (back) the data!

In case you want a big picture view, here is the full code available on our GitHub account.

Web Scraping is an essential web development skill. Want to get ahead in web development? Read this blog post about the best tools for web development workflow.

 

LaurenGlass

Thanks for reading! My name is Lauren Glass. I am an entrepreneur, data scientist, & developer living in Tel Aviv.

Check out more of my writing at CodingStartups

Follow my adventures on Instagram @ljglass

8 top must-use tools to boost your web development workflow

8 top must-use tools to boost your web development workflow

As developers, before we deploy our applications or even before we choose our cloud provider, we should consider which tools we use for our day-to-day internal workflow. The tools included in our toolbox can either boost our productivity dramatically or turn our web development project extremely complex and difficult to maintain or scale up by recruiting more team members.

A major part of growing ourselves from being junior developers into senior developers involves adaptation of tools that simplifying our task management process, making communication with other team members seamless and building integrations between the tools we use so they work together in harmony to create a perfect stack that works best for you and for your team.

As a technical startup co-founders, we have the responsibility of creating workflows that work well at scale and are easy to use, adapt and maintain by most of the developers that will join our team. In order to implement the most productive workflows for our team, we need to master them ourselves at first.

In this post, I will introduce you to the set of tools that most of the junior web developers use on a daily basis to manage, analyze and maintain their products. You might already be well-familiar with some of them, and therefore, my goal is to not only introduce you to these tools but also provide you with best practices of how to use them and how to integrate them together to create a harmony that works for you.

Before I start listing the tools and diving deep into each of them, I want to mention the most important consideration of all which is the operating system you use. I’m not going to get into further details about operating system considerations here because I’ve already discussed it in depth in my previous post lessons learned moving from Windows to Ubuntu.

Slack

slack

What it is used for

Slack is a communication platform for teams. Despite its initial goal of completely replacing the need for emails which hasn’t been achieved in my opinion, Slack has so many additional advantages. Even if you’re still working alone, keep reading – Slack can be an amazing tool also for individuals.

Slack introduces a new and seamless way to communicate internally with team members, stay on top of milestones, goals, and issues, schedule meetings, and even order lunch.

Rather than having one chat in which all the team communicates in, Slack introduced us with channels. Channels are rooms in which you can discuss different aspects of your company, venture or project: development, sales, PPC campaigns, UI / UX and much more.

Slack provides you with everything you need to manage a rich conversation with your team members: emojis, image sharing, YouTube videos embedding, and of course, integrations.

Integrations provide you with the ability to connect 3rd party tools into your Slack group. You can either install public tools from Slack’s marketplace or develop your own using Slack API and use it inside your Slack group. Slack integrations allow you to schedule meetings with team members by sending a message, set a recurring reminder, notify when a new user signs up or subscribes, order lunch, entertain the team by reacting to specific messages and so much more.

Slack’s search system is robust. Every message is indexed and therefore it is extremely easy to recover anything said in any channel.

Who should use it

Naturally, Slack is built for teams. But, as a developer that works alone on a side project, I encourage you to open yourself a Slack group and play around with everything Slack has to offer. You can increase productivity by sending messages to yourself for settings reminders and schedule meetings instead of accessing many apps in the browser.

slack 2

Best practices

  • Investigate the top integrations which Slack’s marketplace has to offer by integrating them into your group.
  • Develop your own integrations by using open source libraries that utilize Slack API. You can push notifications of newly subscribed users to keep your team on top of your company’s milestones. A perfect culture can be built using Slack.
  • Learn Slack’s keyboard shortcuts for increased productivity.
  • Check out BitBucket integration for Slack to notify a specific channel for each push to production.
  • Check out All-in-one messenger tool further in this post to better use Slack on your desktop computer.

Pricing model

Slack’s pricing model offers a free plan that can serve small teams perfectly with the ability to search and access only the most recent 10K messages (Once you subscribe you can access all of your messages). For Standard and Plus plans you pay per team members and get more integrations, prioritized support and more.

Tip for advanced users

Slack is not used only for internal teams but also for public communities. There are thousands of Slack communities you can join (most of them for free) to discuss with people from all over the world about product, design, development and much more. One of the directories lists Slack communities is Slack List.

Link to Slack

Trello

trello

What it is used for

Trello is a simple yet wonderful task management (or project management) tool. Trello can be used to manage development workflow and tasks, as well as marketing projects, blogs, online businesses and more.

Trello’s user interface is very simplistic and minimalistic but has anything you need in order to manage a project with up to 10 team members, including task labeling, attachments, task assignments and task scheduling.

Who should use it

As a solo developer who runs his own side project, Trello can be a perfect match for managing your tasks and workflow. Once you add new team members (up to 10), Trello contains everything you need to keep managing the project efficiently. Notice that Trello might not be enough for projects that grow to more than 10 team members or have many moving parts.

trello 2

Best practices

  • Use boards for different projects on the same team. You can open a board for marketing, back-end development, front-end development etc.
  • Use different background colors for each board for easier recognition.
  • Keep the left menu open for faster navigation.
  • Assign tasks to team members or watch tasks yourself by dragging and dropping profile pictures from the right menu to a specific task.
  • When starting a project, define your labels by opening a task and clicking on labels. There you can assign labels with titles so you can label your tasks afterward.
  • Use different columns in a board for either listing tasks of different components in your system, or for listing To do, doing and done tasks.

Pricing model

All of the essential features Trello provides can be found in the free plan. For integrations, better security and support check out the Business and Enterprise plans, although in my opinion when scaling up your project you might want to look into different task management solutions.

Tip for advanced users

To see examples of Trello boards and get inspired by them, browse here.

Link to Trello

Redash

redash

What it is used for

Redash is a great open-source tool for visualizing your data in a dedicated dashboard. It provides you with everything you need to give your team the ability the query data, visualize and share it.

It integrated with all of the most popular data sources including MySQL, PostgreSQL, MongoDB, ElasticSearch and much more.

With Redash you can generate visualizations to track milestones and keep yourself and your team engaged with what is going on with your project.

You can also create alerts that will pro-actively warn you about important changes.

Who should use it

Once you deploy your application to production and start collecting data by pushing it to your databases, you should consider using Redash. It can help you monitor potential issues, track your progress to achieve your milestones, get insights from your data and more.

redash 2

Best practices

  • Integrate Redash daily metrics with Slack to push them automatically on a daily basis. This way you don’t need to enter your dashboard daily but only stay in your Slack group and engage your team members as well.

Pricing model

Redash is open-source and therefore is completely free of charge. If you’d like to get a hosted and managed instance of Redash, you can pick on of the paid plans.

Tip for advanced users

Once you feel something is missing, implement it and contribute to the Github repository.

Link to Redash

Zapier

zapier

What it is used for

How many times did you tell to yourself: if we could push the data from Facebook ads to a Google spreadsheet it would be great! And then a few minutes later you find yourself struggling with APIs to get the integration done?

Zapier is a great tool that worth investigating exactly for this reason. It teaches us, as developers, that we don’t have to run and implement every integration we want to achieve in our company. Not only that but the less code we have in our system and the less in-house developments, the better.

Zapier moves info between web apps automatically by integrating more than 750 apps. It allows you to create automated processes and workflows with a few clicks of a button that will last forever.

With Zapier you can, for example, push every issue from BitBucket to Slack in a 2 minutes setup or create Trello cards from Google Form responses.

Who should use it

As developers, we are used to dealing with APIs on a daily basis. I encourage you to check out what Zapier has to offer next time before you’re getting into coding your own integration. It might save you A LOT of time.

If you’re running your own company, consider using Zapier as soon as possible in order to avoid redundant development projects, bugs, and maintainability.

zapier 2

Best practices

  • Sign up with Zapier today.
  • Check out Zapier examples to get inspired about how broad automation can reach.

Pricing model

Zapier offers a free forever plan with limited 2-step zaps and integrations. The free forever plan is definitely enough for playing around with the tool. Once you’re getting real value from Zapier you can consider one of the paid plans without limitations on the zaps you can automate.

Tip for advanced users

Try to work with Google Sheets as much as possible. It will simplify things for you.

Link to Zapier

Draw.io

draw io

What it is used for

Draw.io is a great tool for prototyping, mock-ups and architecture design. It can be used in a wide variety of ways thanks to its template collections while the main focus for using Draw.io is for designing processes, systems, and views before implementing them with code (or with photoshop).

Draw.io is an add-on for Google Drive, therefore it exposes all the sharing and collaboration capabilities that Google Drive has. You can seamlessly collaborate with additional team members on designing servers architecture, for example.

Draw.io offers many components for easy insertion into the sketch. You can go from flow charts up until Android, Bootstrap or iOS screens.

draw io 2

Who should use it

Draw.io is one of the best sketch tools I know, and it’s completely free. I encourage you to try and use it for your next project while in the designing stage.

Pricing model

Draw.io is offered free of charge.

Link to Draw.io

All-in-one messenger

all in one

What it is used for

Most of us have more than one channel of communication with our co-workers, friends, and family. Usually, each communication channel, like WhatsApp, Slack or Facebook Messenger, has its own web application which makes it relatively difficult to stay on top of everything.

All-in-one Messenger is an awesome Chrome application for collecting all your communication channels in one place. It enables you to open a new tab for each communication channel and supports all of the most popular ones. They act and feel in the same way and therefore it is easy to operate them.

Who should use it

From individual developers to companies, All-in-one messenger is applicable for everyone who deals with more than one communication channel on a daily basis.

all in one 2

Best practices

  • Although it’s not so clear, you can add more than one tab for the same communication channel. For example, if you’re a member of more than one Slack group, you can add all of them as different tabs and name them accordingly.

Pricing model

All-in-one messenger is free of charge.

Tip for advanced users

If you want to be more productive at work (which I guess you do, otherwise you weren’t read this post), do yourself a favor and cancel the notifications in the settings tab.

Link to All-in-One Messenger

BitBucket

bitbucket

What it is used for

BitBucket is a distributed version control system that makes it easy for you to collaborate with your team. BitBucket is owned by Atlassian which owns Jira, HipChat, and Trello that are great products for developers as well.

BitBucket, unlike Github, offers private repositories for up to five users for free. BitBucket user-interface is welcoming and easy to use, and the integrations that BitBucket offers are extremely helpful.

Who should use it

For teams of developers, the usage of a version control system is obvious (hopefully). As a solo developer, I encourage you to use a BitBucket as your version control system to manage your code versions, deploy your app to production, integrate with 3rd party tools for code inspection and more.

bitbucket 2

Best practices

  • Use BitBucket & Slack integration to push notifications from BitBucket directly to your development channel inside your Slack group.

Pricing model

As aforesaid, BitBucket offers unlimited private code repositories for up to five collaborators. Once you want to scale up your team you’ll need to upgrade your subscription by paying per user per month.

Link to BitBucket

Postman

postman

What it is used for

As web developers, we often deal with creating APIs for exposing our backend code to different clients like front end apps, mobile apps, and 3rd party cooperations. When building APIs or when using APIs own by different entities, it sometimes difficult to test, document and monitor them.

Postman is a Chrome application allows you to easily send HTTP requests to either local or global server with any parameters, headers and authentication settings you need.

Postman, unlike other tools out there, has a wonderful GUI (graphical user interface) for defining your HTTP request and analyzing the response.

Who should use it

From individual developers who develop or test their own API to companies that require team collaboration and sharing.

postman 2

Best practices

  • Keep Postman open when developing web applications, you’ll most likely find it useful.

Pricing model

Postman’s free forever plan offers everything you need as a solo developer working on his own side project. For team collaboration and advanced features see the paid plans.

Link to Postman

Adapting productive habits for your web development workflow is a must. For you own productivity, and for the team you’ll be in charge of soon, try and play around with different tools to figure out your perfect match.

3 best practices for better setting up your Django project

3 best practices for better setting up your Django project

Django is a robust open source Python-based framework for building web applications. Django has gained an increase in its popularity during the last couple of years, and it is already mature and widely-used with a large community behind it. Among other Python-based frameworks for creating web applications (Like Flask and Pyramid), Django is by far the most popular. It supports both Python version 2.7 and Python 3.6 but as for the time of this article being written, Python 2.7 is still the more accessible version in terms of community, 3rd party packages, and online documentation. Django is secured when used properly and provides high dimensions of flexibility, therefore is the way to go when developing server-side applications using Python.

In this article, I will share with you best practices of a Django setup I’ve learned and collected over the recent years. Whether you have a few Django projects under your belt, or you’re just about to start your first Django project from scratch, the collection described here might help you create better applications down the road. The article has been written from a very practical mindset so you can add some tools to your development toolbox immediately, or even create yourself an advanced custom Django boilerplate for your next projects.

* In this article I assume you’re using a Linux Ubuntu machine.

Virtual Environment

While developing Python-based applications, using 3rd party packages are an ongoing thing. Typically, these packages are being updated often so keeping them organized is essential. When developing more and more different projects on the same local machine, it’s challenging to keep track on the current version of each package, and impossible to use different versions of the same package for different projects. Moreover, updating a package on one project might break functionality on another, and vice versa. That’s where Python Virtual Environment comes handy. To install virtual environment use:

$ apt-get update
$ apt-get install python-pip python-dev build-essential

$ export LC_ALL="en_US.UTF-8" # might be necessary in case you get an error from the next line

$ pip install --upgrade pip
$ pip install --upgrade virtualenv
$ mkdir ~/.virtualenvs
$ pip install virtualenvwrapper
$ export WORKON_HOME=~/.virtualenvs
$ nano ~/.bashrc

add this line to the end of the file:

. /usr/local/bin/virtualenvwrapper.sh

then execute:

$ . .bashrc

After installing, create a new virtual environment for your project by typing:

$ mkvirtualenv project_name

While you’re in the context of your virtual environment you’ll notice a prefix that is being added to the terminal, like:

(project_name) ofir@playground:~$

In order to deactivate (exit) the virtual environment and getting back to the main Python context of your local machine, use:

$ deactivate

In order to activate (start) the virtual environment context, use:

$ workon project_name

To list the virtual environments exist in your local machine, use:

$ lsvirtualenv

Holding your project dependencies (packages) in a virtual environment on your machine allows you to keep them in an isolated environment and only use them for a single (or multiple) projects. When creating a new virtual environment you’re starting a fresh environment with no packages installed in it. Then you can use, for example:

(project_name) $ pip install Django

for installing Django in your virtual environment, or:

(project_name) $ pip install Django==1.11

for installing version 1.11 of Django accessible only from within the environment.

Neither your main Python interpreter nor the other virtual environments on your machine will be able to access the new Django package you’ve just installed.

In order to use the runserver command using your virtual environment, while in the context of the virtual environment, use:

(project_name) $ cd /path/to/django/project
(project_name) $ ./manage.py runserver

Likewise, when entering the Python interpreter from within the virtual environment by typing:

(project_name) $ python

it will have access to packages you’ve already installed inside the environment.

Requirements

Requirements are the list of Python packages (dependencies) your project is using while running, including version for each package. Here’s an example for a requirements.txt file:

dicttoxml==1.7.4
Django==1.11.2
h5py==2.7.0
matplotlib==2.0.2
numpy==1.13.0
Pillow==4.1.1
psycopg2==2.7.1
pyparsing==2.2.0
python-dateutil==2.6.0
pytz==2017.2
six==1.10.0
xmltodict==0.11.0

Keeping your requirements.txt file up to date is essential for collaborating properly with other developers, as well as keeping your production environment properly configured. This file, when included in your code repository, enables you to update all the packages installed in your virtual environment by executing a single line in the terminal, and by that to get new developers up and running in no time. In order to generate a new requirements.txt or to update an existing one, use from within your virtual environment:

(project_name) $ pip freeze > requirements.txt

For your convenience, make sure to execute this command in a folder that is being tracked by your Git repository so other instances of the code will have access to the requirements.txt file as well.

Once a new developer is joining the team, or you want to configure a new environment using the same packages listed in the requirements.txt file, execute in the virtual environment context:

(project_name) $ cd /path/to/requirements/file
(project_name) $ pip install -r requirements.txt

All requirements listed in the file will immediately be installed in your virtual environment. Older versions will be updated and newer versions will be downgraded to fit the exact list of requirements.txt. Be careful though, because there might be differences sometimes between different environments that you still want to respect.

I highly recommend integrating these commands to your work flow: updating the requirements.txt file before pushing code to the repository and installing requirements.txt file after pulling code from the repository.

Better settings.py Configuration

Django comes out-of-the-box with very basic yet useful settings.py file, defines the main and most useful configurations for your project. The settings.py file is very straightforward, but sometimes, as a developer working in a team, or when settings up a production environment, you often need more than one basic settings.py file.

Multiple settings files allow you to easily define tailor-made configurations for each environment separately like:

ALLOWED_HOSTS # for production environment
DEBUG
DATABASES # for different developers on the same team

Let me introduce you to an extended approach for configuring your settings.py file which allows you to easily maintain different versions and use the one you want in any given time and environment in no time.

First, navigate to your settings.py file path:

(project_name) $ cd /path/to/settings/file

Then create a new module called settings (module is a folder containing an __init__.py file):

(project_name) $ mkdir settings

Now, rename your settings.py file to base.py and place it inside the new module you created:

(project_name) $ mv settings.py settings/base.py

For this example, I assume that you want to configure one settings file for your development environment and one for your production environment. You can use the exact same approach for defining different settings files for different developers in the same team.

For your development environment create:

(project_name) $ nano settings/development.py

Then type:

from .base import *

DEBUG = True

and save the file by hitting Ctrl + O, Enter and then Ctrl + X.

For your production environment create:

(project_name) $ nano settings/production.py

and type:

from .base import *

DEBUG = False
ALLOWED_HOSTS = [‘app.project_name.com’, ]

Now, whenever you want to add or update settings of a specific environment you can easily do it in its own settings file. The last question that should be asked is how Django knows which settings file to load on each environment? And the answer is: that’s what the __init__.py file is used for. When Django looks for the settings.py it used to load when running the server, for example, it now finds a settings module rather than a settings.py file. But as long as it’s a module containing an __init.py__ file, as far as Django is concerned, it’s the exact same thing. Django will load the __init__.py file and execute whatever written in it. Therefore, we need to define which settings file we want to load inside the __init__.py file, by executing:

(project_name) $ settings/__init__.py

and then, for a production environment, for example, typing:

from .production import *

This way, Django will load all the base.py and production.py settings every time it starts. Magic?

Now, the only configuration left is to keep the __init__.py in your .gitignore file so it will not be included in pushes and pulls. Once you set up a new environment, don’t forget to create a new __init__.py file inside the settings module and import the settings file required exactly like we did before.

In this article we’ve covered three best practices for better setting up your Django project:

  • Working inside a virtual environment
  • Keeping requirements.txt file up to date and use it continuously in your work flow
  • Setting up a better project settings array.

This is part 1 in the series about best practices for Django development. Follow me to get an immediate update once the next parts will be available.

Have you followed these best practices in your last project? Do you have any insights to share? Comments are highly appreciated.