Python Web Scraper for beginner

Table Of Content:

Introduction: 

In this article, we will be creating a simple web scraper program which will allow you to gather any information you want from a website, without having to read through the entire article or webpage. This is a very useful and versatile program that can be used for various projects.

Python Web Scraper for beginner

Who is this Project For?

This is an intermediate-level project for those who are new to Python. Before starting this project, you should already have experience printing, for loops, and using functions, as these concepts will be used throughout the project. Basic knowledge of HTML is also useful, but not necessary.

What Will We Learn?

This program uses Python’s BeautifulSoup library to extract data from the website specified in the code. The program searches the webpage’s HTML for specified tags or classes and selects that information to display in the program. The concepts learned through creating this program are essential for learning Python, making this the perfect project to practise and apply these skills.

Features to Consider:

  • The program will be provided with a web page to search
  • The program will search the web page and print all of the text that is displayed using the specified tags

Pseudo Code:

Here is the pseudocode for this project:

Import the BeautifulSoup library and requests

Create a page object and set the webpage to gather information from

Create a variable called ‘soup’ and make it gather the HTML information from the page object

Print (“Top books and number of copies sold”)

For link in soup.find_all():

Print a numbered list with the gathered information

Python Web Scraper for beginner

Main Steps:

This project can be broken down into 3 main steps: 

  1. Import the libraries
  2. Set up the BeautifulSoup object
  3. Print the output

Step 1: Import the Libraries

The first thing we’ll need to do when creating this game is import the BeautifulSoup libraries and the requests module. Because these are not a built-in part of the Python programming language, we will need to import them into our project before we can begin using them. We will do this using this code:

from bs4 import BeautifulSoup
import requests

Note: You may need to use the terminal to install the BeautifulSoup library, which can be done using the ‘pip install’ command

Step 2: Set up the BeautifulSoup object

Now that we have imported the library, we need to create an object that will be used to gather and store information for the program. In this case, we will be calling this object ‘page’. We will start by defining what website the object is going to draw information from, and we will also set up the HTML parser, which will be used to parse the HTML file for the web page. The code for this section is as follows:

page = requests.get ('https://entertainment.howstuffworks.com/arts/literature/21-best-sellers.htm')
soup = BeautifulSoup(page.content, 'html.parser')

Step 3: Print the Output

For the final step in this project, we will have to output all of the information that we have gathered from the website. We will do this by creating a print statement, and using a for-loop to iterate through all of the instances of a specific tag. The program will then print this information, producing a list of the desired information.

print("Top Books and Number of Copies Sold:\n")
i=1
for link in soup.find_all('a', {"class": "text-lighter-gray hover:text-green"}):
  print(str(i) + ". ", end = "")
  i+=1
  print(link.text, end = "")
  print(" copies sold")

As you can see, this section of code begins by printing a title for the list, as well as creating a variable, i, which will be used to number the list. Then, the code iterates through, looking for any elements with the tag 

<a class="text-lighter-gray hover:text-green" href="#pt1"></a>

You can find which tag to use by going to inspect element and selecting the object you would like to use.

Python Web Scraper for beginner
Python Web Scraper for beginner

Finally, the program will print the number of the list, increment i by 1, print the information, and add the words “copies sold” to the end of the list element.
Note: The print statements we have used have an added line that says end = “” at the end. This means that rather than adding text and beginning a new line, the following print statement will be added to the same line as the previous.

Project Complete!

Python Web Scraper for beginner

Now the project is complete! We hope you’ve had fun creating this simple and versatile web scraper program, and hopefully, you’ve learned more about programming with Python! Make sure to test your code now and see how it works. If you’re stuck or have any issues with your code, try reviewing it again either in your text editor or by looking at the code included in the article as a reference.

Geek Team

Geekedu is an expert in Coding and Math learning. Our goal is to inspire and empower youth to use their knowledge of technology to become the influencers, inventors and innovators of the future.

Sign up and get a 60-minute free assessment class

Book A FREE Trial