In this article, we will use Beautiful Soup 4 to create a basic headless web scraping "bot" in Python on Alibaba Cloud Elastic Compute Service (ECS) using CentOS 7.
In this tutorial You should be accustomed to launching an Alibaba Cloud instance running CentOS. If you don't know how to set up an ECS instance, check out this tutorial. If you have already purchased, [this tutorial](https://www.alibabacloud.com/blog/how-to-set-up-your-first-centos-7-server-on-alibaba-cloud_593743?spm Check = a2c65.11461447.0.0.45066ff90S594T) and configure the server accordingly.
I have deployed an instance of CentOS for this lesson. It's better not to be bloated for this project. This project does not use a GUI (Graphical User Interface), so it is recommended that you have some basic terminal command line knowledge.
It's always a good idea to update all of a particular instance. First, update all packages to the latest version.
sudo yum update
I plan to use Python for basic web scraping "bot". I am also impressed with the relatively simpleness of the language and the variety of modules. In particular, use the Requests and Beautiful Soup 4 modules.
Normally Python3 is installed by default, but if not, install Python3 and Pip. First, install IUS (short for Inline with Upstream Stable). IUS is a community project that provides Red Hat Package Manager (RPM) packages. Next, install python36u and pip.
sudo yum install https://centos7.iuscommunity.org/ius-release.rpm
sudo yum install python36u
sudo yum install python36u-pip
Pip is a package management system for installing and managing software packages such as those found in the Python Package Index. Pip is an alternative to easy_install.
Note that the pip installation is for Python 2.7 and the python36u-pip is for Python 3 as I've had trouble installing pip instead of python36u-pip in the past.
Nano is a basic text editor and comes in handy for such applications. Now let's install Nano.
sudo yum install nano
Next, you need to install the Python packages you use today, Requests and Beautiful Soup 4.
These are installed through PIP.
pip36u install requests
pip36u install beautifulsoup4
Requests is a Python module that allows you to navigate to web pages using the Requests .get method.
Requests allows you to send HTTP / 1.1 requests programmatically using Python scripts. You don't have to manually add the query string to the URL or form-encode the POST data. Keep-alive and HTTP connection pooling are 100% automatic. Today, we will get the source of the web page centering on the Requests .get method.
Beautiful Soup is a Python library for retrieving data from HTML and XML files. You can easily navigate, search, change the analysis tree, etc. by linking with your favorite parser.
Use Beautiful Soup 4 with Python's standard html.parer to parse and organize the data from the web page sources retrieved by Requests. In this tutorial, we use the beautiful soup "prettify" method to organize our data in a more human-readable way.
Create a folder called Python_apps. Then change your current working directory to Python_apps.
mkdir Python_apps
cd Python_apps
Well, I'm finally looking forward to it. You can write Python headless scraper bots. We are using the request to go to the URL and get the source of the page. Then use Beautiful Soup 4 to parse the HTML source and make it semi-readable. After doing this, save the parsed data to a local file on your instance. Now let's get down to work.
Use Requests to get the page source and BeautifulSoup4 to format the data in a readable state. Then use the open () and write () Python methods to save the page data to your local hard drive. Let's go.
Open Nano or your favorite text editor in a terminal and create a new file named "bot.py". I find the Nano perfectly suited for basic text editing features.
First, add the import.
############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup
The code below defines some global variables.
1, user URL input 2. Requests.get method to get the entered URL 3. Requests.text method to save text data in variable
####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("WHAT URL WOULD YOU LIKE TO SCRAPE? ")
####### REQUEST GET METHOD for URL
r = requests.get("http://" + url)
####### DATA FROM REQUESTS.GET
data = r.text
Now, let's convert the global variable "data" to a BS4 object and format it with the BS4 prettify method.
####### MAKE DATA VAR BS4 OBJECT
source = BeautifulSoup(data, "html.parser")
####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()
Let's output these variables in the terminal as well as the local file. This shows what data will be written to the local file before it is actually written to the local file.
print(source)
First get the source in a large chunk of text. This is very difficult for humans to decipher, so we will rely on Beautiful Soup to help with the format. So let's call the Prettify method to better organize our data. This makes it much easier for humans to read. Then output the source after the BS4 prettify () method.
print(pretty_source)
After running the code, at this point the terminal should show the predefined format of the HTML source of the entered page.
Now, let's save the file to a local hard drive on the Alibaba Cloud ECS instance. To do this, you first need to open the file in write mode.
To do this, pass the string "w" as the second argument to the open () method.
####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
### GET RID OF ENCODING ISSUES ##########################################
#local_file.write(pretty_source.encode('utf-8'))
####### CLOSE FILE
local_file.close()
In the above code block, we create a file with the name concatenated with "_scrapped.txt" to the URL you entered earlier and create a variable to open. The first argument of the open method is the filename on the local disk. "HTTPS: //" and "HTTP: //" have been removed from the file name. If you do not remove this, the filename will be invalid. The second argument is write permission in this case.
Then pass the "pretty_source" variable as an argument to the .write method and write to the variable "local_file". If you need to UTF-8 encode the text to print correctly to a local file, use the commented out line. Then close the local text file.
Let's run the code and see what happens.
python3.6 bot.py
You will be asked to enter the URL to scrap. Let's try https://www.wikipedia.org. The properly formatted source code from a particular website is now saved as a .txt file in your local working directory.
The final code for this project looks like this:
print("*" * 30 )
print("""
#
# SCRIPT TO SCRAPE AND PARSE DATA FROM
# A USER INPUTTED URL. THEN SAVE THE PARSED
# DATA TO THE LOCAL HARD DRIVE.
""")
print("*" * 30 )
############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup
####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("ENTER URL TO SCRAPE")
####### REQUEST GET METHOD for URL
r = requests.get(url)
####### DATA FROM REQUESTS.GET
data = r.text
####### MAKE DATA VAR BS4 OBJECT
source = BeautifulSoup(data, "html.parser")
####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()
print(source)
print(pretty_source)
####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
#local_file.write(pretty_source.decode('utf-8','ignore'))
#local_file.write(pretty_source.encode('utf-8')
####### CLOSE FILE
local_file.close()
You learned how to use Beautiful Soup 4 to build a basic headless web scraping "bot" in Python on an Alibaba Cloud Elastic Compute Service (ECS) instance with CentOS 7. I used Requests to get the source code for a particular web page, used Beautiful soup 4 to parse the data, and finally saved the scraped web page source code to a local text file on the instance. You can use the Beautiful soup 4 module to format the text for human readability.
Recommended Posts