Write a basic headless web scraping "bot" in Python with Beautiful Soup 4

In this article, we will use Beautiful Soup 4 to create a basic headless web scraping "bot" in Python on Alibaba Cloud Elastic Compute Service (ECS) using CentOS 7.

Set up a CentOS instance on Alibaba Cloud

In this tutorial You should be accustomed to launching an Alibaba Cloud instance running CentOS. If you don't know how to set up an ECS instance, check out this tutorial. If you have already purchased, [this tutorial](https://www.alibabacloud.com/blog/how-to-set-up-your-first-centos-7-server-on-alibaba-cloud_593743?spm Check = a2c65.11461447.0.0.45066ff90S594T) and configure the server accordingly.

I have deployed an instance of CentOS for this lesson. It's better not to be bloated for this project. This project does not use a GUI (Graphical User Interface), so it is recommended that you have some basic terminal command line knowledge.

Install Python 3, PIP3, Nano from the command line of the terminal

It's always a good idea to update all of a particular instance. First, update all packages to the latest version.

sudo yum update

I plan to use Python for basic web scraping "bot". I am also impressed with the relatively simpleness of the language and the variety of modules. In particular, use the Requests and Beautiful Soup 4 modules.

Normally Python3 is installed by default, but if not, install Python3 and Pip. First, install IUS (short for Inline with Upstream Stable). IUS is a community project that provides Red Hat Package Manager (RPM) packages. Next, install python36u and pip.

sudo yum install https://centos7.iuscommunity.org/ius-release.rpm
sudo yum install python36u
sudo yum install python36u-pip

Pip is a package management system for installing and managing software packages such as those found in the Python Package Index. Pip is an alternative to easy_install.

Note that the pip installation is for Python 2.7 and the python36u-pip is for Python 3 as I've had trouble installing pip instead of python36u-pip in the past.

Nano is a basic text editor and comes in handy for such applications. Now let's install Nano.

sudo yum install nano

Install Python packages using Pip

Next, you need to install the Python packages you use today, Requests and Beautiful Soup 4.

These are installed through PIP.

pip36u install requests
pip36u install beautifulsoup4

Requests is a Python module that allows you to navigate to web pages using the Requests .get method.

Requests allows you to send HTTP / 1.1 requests programmatically using Python scripts. You don't have to manually add the query string to the URL or form-encode the POST data. Keep-alive and HTTP connection pooling are 100% automatic. Today, we will get the source of the web page centering on the Requests .get method.

Beautiful Soup is a Python library for retrieving data from HTML and XML files. You can easily navigate, search, change the analysis tree, etc. by linking with your favorite parser.

Use Beautiful Soup 4 with Python's standard html.parer to parse and organize the data from the web page sources retrieved by Requests. In this tutorial, we use the beautiful soup "prettify" method to organize our data in a more human-readable way.

Create a folder called Python_apps. Then change your current working directory to Python_apps.

mkdir Python_apps
cd Python_apps

Write a headless scraping bot in Python

Well, I'm finally looking forward to it. You can write Python headless scraper bots. We are using the request to go to the URL and get the source of the page. Then use Beautiful Soup 4 to parse the HTML source and make it semi-readable. After doing this, save the parsed data to a local file on your instance. Now let's get down to work.

Use Requests to get the page source and BeautifulSoup4 to format the data in a readable state. Then use the open () and write () Python methods to save the page data to your local hard drive. Let's go.

Open Nano or your favorite text editor in a terminal and create a new file named "bot.py". I find the Nano perfectly suited for basic text editing features.

First, add the import.

############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup

The code below defines some global variables.

1, user URL input 2. Requests.get method to get the entered URL 3. Requests.text method to save text data in variable

####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("WHAT URL WOULD YOU LIKE TO SCRAPE? ")
####### REQUEST GET METHOD for URL
r = requests.get("http://" + url)
####### DATA FROM REQUESTS.GET
data = r.text

Now, let's convert the global variable "data" to a BS4 object and format it with the BS4 prettify method.

####### MAKE DATA VAR BS4 OBJECT
source = BeautifulSoup(data, "html.parser")
####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()

Let's output these variables in the terminal as well as the local file. This shows what data will be written to the local file before it is actually written to the local file.

print(source)

First get the source in a large chunk of text. This is very difficult for humans to decipher, so we will rely on Beautiful Soup to help with the format. So let's call the Prettify method to better organize our data. This makes it much easier for humans to read. Then output the source after the BS4 prettify () method.

print(pretty_source)

After running the code, at this point the terminal should show the predefined format of the HTML source of the entered page.

Now, let's save the file to a local hard drive on the Alibaba Cloud ECS instance. To do this, you first need to open the file in write mode.

To do this, pass the string "w" as the second argument to the open () method.

####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
### GET RID OF ENCODING ISSUES ##########################################
#local_file.write(pretty_source.encode('utf-8'))
####### CLOSE FILE
local_file.close()

In the above code block, we create a file with the name concatenated with "_scrapped.txt" to the URL you entered earlier and create a variable to open. The first argument of the open method is the filename on the local disk. "HTTPS: //" and "HTTP: //" have been removed from the file name. If you do not remove this, the filename will be invalid. The second argument is write permission in this case.

Then pass the "pretty_source" variable as an argument to the .write method and write to the variable "local_file". If you need to UTF-8 encode the text to print correctly to a local file, use the commented out line. Then close the local text file.

Let's run the code and see what happens.

python3.6  bot.py

You will be asked to enter the URL to scrap. Let's try https://www.wikipedia.org. The properly formatted source code from a particular website is now saved as a .txt file in your local working directory.

The final code for this project looks like this:

print("*" * 30 )
print("""
# 
# SCRIPT TO SCRAPE AND PARSE DATA FROM
# A USER INPUTTED URL. THEN SAVE THE PARSED
# DATA TO THE LOCAL HARD DRIVE.
""")
print("*" * 30 )

############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup

####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("ENTER URL TO SCRAPE")

####### REQUEST GET METHOD for URL
r = requests.get(url)

####### DATA FROM REQUESTS.GET
data = r.text

####### MAKE DATA VAR BS4 OBJECT 
source = BeautifulSoup(data, "html.parser")


####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()

print(source)

print(pretty_source)

####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
#local_file.write(pretty_source.decode('utf-8','ignore'))
#local_file.write(pretty_source.encode('utf-8')
####### CLOSE FILE
local_file.close()

Overview

You learned how to use Beautiful Soup 4 to build a basic headless web scraping "bot" in Python on an Alibaba Cloud Elastic Compute Service (ECS) instance with CentOS 7. I used Requests to get the source code for a particular web page, used Beautiful soup 4 to parse the data, and finally saved the scraped web page source code to a local text file on the instance. You can use the Beautiful soup 4 module to format the text for human readability.

Recommended Posts

Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
Try scraping with Python + Beautiful Soup
Scraping with Selenium in Python (Basic)
Scraping with Beautiful Soup in 10 minutes
[Python] Scraping a table using Beautiful Soup
Scraping with Beautiful Soup
[Python] Delete by specifying a tag with Beautiful Soup
Web scraping with python + JupyterLab
Scraping with selenium in Python
Web scraping notes in python3
Scraping with chromedriver in python
Scraping with Selenium in Python
Web scraping beginner with python
Table scraping with Beautiful Soup
Scraping Google News search results in Python (2) Use Beautiful Soup
Write a binary search in Python
Scraping multiple pages with Beautiful Soup
Web scraping with Python ① (Scraping prior knowledge)
[Python] A memorandum of beautiful soup4
Write A * (A-star) algorithm in Python
Scraping pages with pagination with Beautiful Soup
Write a pie chart in Python
Write a vim plugin in Python
Write a depth-first search in Python
Website scraping with Python's Beautiful Soup
Write a batch script with Python3.5 ~
Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath
Write documentation in Sphinx with Python Livereload
Spiral book in Python! Python with a spiral book! (Chapter 14 ~)
WEB scraping with Python (for personal notes)
Write the test in a python docstring
Getting Started with Python Web Scraping Practice
Write a short property definition in Python
Daemonize a Python web app with Supervisor
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
[Python] A quick web application with Bottle!
Scraping a website using JavaScript in Python
Getting Started with Python Web Scraping Practice
Write a simple greedy algorithm in Python
Write a TCP client with Python Twisted
Try HTML scraping with a Python library
Write a simple Vim Plugin in Python 3
Run a Python web application with Docker
Let's make a web framework with Python! (1)
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Let's make a Twitter Bot with Python!
Let's make a web framework with Python! (2)
[For beginners] Try web scraping with Python
Let's throw away JavaScript and write a web front end in Python!
Put Docker in Windows Home and run a simple web server with Python
Scraping with Python
[Python] Get the files in a folder with Python
Make a Twitter trend bot with heroku + Python
Scraping with Python
Create a LINE BOT with Minette for Python
Create a virtual environment with conda in Python
Start a simple Python web server with Docker
Steps to develop a web application in Python