A Step-by-Step Guide to Web Scraping Using Python's Beautiful Soup

One of the key sources of data for both organizations and researchers is the internet. While one could copy-paste relevant information from webpages and transfer it to excel sheets, CSV, and TXT files, it isn't feasible if larger data sets are required. So, how are larger data sets collected by Data Scientists from the internet?

One of the most frequent, and time-consuming tasks that Data Scientists have to perform is collecting data and cleaning it.

Web Scraping

While web-based data collection can be a challenging task via a manual approach, a lot of automated solutions have cropped up courtesy of open-source contributions from software developers.

The technical term for this is web scraping or web extraction. With the use of automated solutions for scraping the web, data scientists can retrieve hundreds, thousands, and even millions of data points. This is highly efficient and saves not just time but also is extremely accurate as it eliminates clerical errors.

Disclaimer: Though web scraping is now becoming an essential part of our data acquisition technique, protecting ourselves from piracy or unauthorized commercial use of extracted data is required.

In this article, we shall explore data scraping with the help of Beautiful Soup (also consider checking out this perfect parcel of information for a data science degree).

What is beautiful soup?

Since 2004, Beautiful Soup has been rescuing programmers to collect data from web pages in a few lines of scripts. Beautiful soup is one of the most widely-used Python libraries for web scraping.

As mentioned in their website, beautiful soup can parse anything we give it. Most commonly, it is used to extract data from HTML or XML documents. It is a simple and easy tool to use. It hides away a lot of the cumbersome complications that would arise if one were to go. However, methods required conceptual understanding. Hence, it is a premier tool among data aficionados to scrape the web.

Essential Steps Before Scraping

Let’s scrape an HTML table from Wikipedia. You will need a basic understanding of HTML DOMs and Python.

Step 1

The first step of web scraping is to find a table we want to scrape, which means figuring out the table and web page we want to scrape. As a research scientist, I would like to give an example of web scraping from a list of countries by cancer rate from Wikipedia (https://en.wikipedia.org/wiki/List_of_countries_by_cancer_rate).

Step 2

Now we are ready to scrape a web page, but do we have all the pivotal tools to scrape it? As I mentioned earlier, we are using python to scrape the web page. I will use Google Colab as the python platform. Google Colab comes with all the required python libraries, so you do not have to worry about any installations. Another important platform for python is Jupyter notebook or Spyder using Anaconda. Anaconda also comes with many python libraries, including Beautiful Soup. However, we may have to download a few libraries in Anaconda. If you are using any other software for python programming, then you have to download all packages required for Web Scraping or any other analysis.

The external parser is required to parse the HTML files, as the beautiful soup package is incapable of parsing it. Three important parsers are supported by the beautiful soup package:

python built-in parsers (html.parser),
lxml, and
html5lib.

We have to install lxml and html5lib as they are not in-built libraries. However, if you are using Google Colab or Anaconda, you do not have to install it. lxml is the only XML library and ranked best by the beautiful soup. Hence, we will use lxml for parsing.

Installation

Though the installation of packages is straightforward, it also depends on the operating system you are using, such as windows, mac, or Unix. Let’s see how to install required packages for web scraping using beautiful soup in windows. Skip thiss step if you are using Google Colab or Anaconda. Commands required to install packages are given below:

Installing Beautiful library
pip install beautifulsoup4
Installing lxml library
pip install lxml
Installing pandas
pip install pandas

Step 3

All required libraries are now installed. Before moving ahead let’s check the table we want to scrape. First right-click on the table. Now it will show us an ‘inspect’ option.

The next step is to click the inspect option. It will open the HTML document of that specific web page.

Once the HTML document is open, we have to check all information regarding the table we are going to scrape.

Here, the class of the table is ‘sortable wikitable’, highlighted in blue color in the HTML document. When we put our cursor on the table, it should also highlight the table (only table) we want to scrape. If it does not highlight the table, it means that we are exploring the wrong table. Once we know the class of our table, we can start scraping (also consider checking out this career guide for data science jobs).

Important steps for web scraping

Here, I am going to show you some required steps to scrape this table. Different tables of web pages may have different challenges. Hence, a good knowledge of HTML, beautiful soup and python is essential. We should get more information about the beautiful soup from the link provided below.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Let’s move and scrape the first table of the Wikipedia page.