In this article, we will be creating a simple web scraper program which will allow you to gather any information you want from a website, without having to read through the entire article or webpage. This is a very useful and versatile program that can be used for various projects.
Who is this Project For?
This is an intermediate-level project for those who are new to Python. Before starting this project, you should already have experience printing, for loops, and using functions, as these concepts will be used throughout the project. Basic knowledge of HTML is also useful, but not necessary.
What Will We Learn?
This program uses Python’s BeautifulSoup library to extract data from the website specified in the code. The program searches the webpage’s HTML for specified tags or classes and selects that information to display in the program. The concepts learned through creating this program are essential for learning Python, making this the perfect project to practise and apply these skills.
Features to Consider:
- The program will be provided with a web page to search
- The program will search the web page and print all of the text that is displayed using the specified tags
Here is the pseudocode for this project:
Import the BeautifulSoup library and requests
Create a page object and set the webpage to gather information from
Create a variable called ‘soup’ and make it gather the HTML information from the page object
Print (“Top books and number of copies sold”)
For link in soup.find_all():
Print a numbered list with the gathered information
This project can be broken down into 3 main steps:
- Import the libraries
- Set up the BeautifulSoup object
- Print the output
Step 1: Import the Libraries
The first thing we’ll need to do when creating this game is import the BeautifulSoup libraries and the requests module. Because these are not a built-in part of the Python programming language, we will need to import them into our project before we can begin using them. We will do this using this code:
Note: You may need to use the terminal to install the BeautifulSoup library, which can be done using the ‘pip install’ command
Step 2: Set up the BeautifulSoup object
Now that we have imported the library, we need to create an object that will be used to gather and store information for the program. In this case, we will be calling this object ‘page’. We will start by defining what website the object is going to draw information from, and we will also set up the HTML parser, which will be used to parse the HTML file for the web page. The code for this section is as follows:
Step 3: Print the Output
For the final step in this project, we will have to output all of the information that we have gathered from the website. We will do this by creating a print statement, and using a for-loop to iterate through all of the instances of a specific tag. The program will then print this information, producing a list of the desired information.
As you can see, this section of code begins by printing a title for the list, as well as creating a variable, i, which will be used to number the list. Then, the code iterates through, looking for any elements with the tag
You can find which tag to use by going to inspect element and selecting the object you would like to use.
Finally, the program will print the number of the list, increment i by 1, print the information, and add the words “copies sold” to the end of the list element.
Note: The print statements we have used have an added line that says end = “” at the end. This means that rather than adding text and beginning a new line, the following print statement will be added to the same line as the previous.
Now the project is complete! We hope you’ve had fun creating this simple and versatile web scraper program, and hopefully, you’ve learned more about programming with Python! Make sure to test your code now and see how it works. If you’re stuck or have any issues with your code, try reviewing it again either in your text editor or by looking at the code included in the article as a reference.