[PYTHON] [EC2] Introduction to scraping using selenium (text extraction and screen capture)

[EC2] Introduction to scraping with selenium

Summary of the flow until extracting the element of the specified URL using python selenium on EC2.

things to do

--Install chrome driver --Install chrome --Installing selenium --Installation of Japanese fonts --Extract the text of the specified URL (text.py) --Get a screen capture of the specified URL (capture.py)

Premise

-Connected to an EC2 instance using ssh. -Python3 is already installed.

How to connect to an EC2 instance using ssh How to build python3 environment on EC2

1. 1. chrome driver installation

(1) Move to the DL page of the version you want to download from the Official page of Chrome Driver.

(2) Copy the link address for linux64.

③ DL and decompress

python


#Move to tmp directory
$ cd/tmp/

#Download chromedriver (URL is copy)
$ wget https://chromedriver.storage.googleapis.com/83.0.4103.39/chromedriver_linux64.zip


#Defrost
$ unzip chromedriver_linux64.zip

#Unzipped file/user/Move under bin
$ sudo mv chromedriver /usr/bin/chromedriver

2. chrome installation

#Complete chrome installation in one sentence
$ curl https://intoli.com/install-google-chrome.sh | bash

Complete!   <-Successful installation
Successfully installed Google Chrome!


#Rename file
$ sudo mv /usr/bin/google-chrome-stable /usr/bin/google-chrome


#Check version
$ google-chrome --version && which google-chrome

Google Chrome 83.0.4103.61 <- --execution result of version
/usr/bin/google-chrome   <-Execution result of which

Contents of each command

3. 3. Install selenium

$ pip3 install selenium

## 4. Japanese font installation ``` $ sudo yum install ipa-gothic-fonts ipa-mincho-fonts ipa-pgothic-fonts ipa-pmincho-fonts ```

If you do not install it, the characters will be garbled when you capture the screen.

Example of garbled characters


## 5. Extract the text of the specified URL (text.py)

① Create a text.py file in the user folder

python


$ cd ~
$ touch text.py
$ vi text.py

② The vim editor will start up, so copy and paste the following. └ Press the "i" key to enter insert mode. └ Copy and paste is "shift + ins" (or right-click and select paste)

python


#-*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True

driver = webdriver.Chrome(options=options)

#Specifying the URL
driver.get("https://www.google.co.jp/")

#Specify the element to scrape
element_text = driver.find_element_by_id("hptl").text

print(element_text)

driver.quit()

③ After pasting, save the vim editor below and finish. esc + :wq + Enter

④ Execute the created file

$ python3 text.py

#Success if the following is displayed
About Google Store
image.png

Scraping the text on the top right of google top is complete.

The meaning of each scraping code


## Get a screen capture of the specified URL (capture.py)

① Create a capture.py file in the user folder

python


$ cd ~
$ touch capture.py
$ vi capture.py

② The vim editor will start up, so copy and paste the following. └ Press the "i" key to enter insert mode. └ Copy and paste is "shift + ins" (or right-click and select paste)

python


#-*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True

#Specify the screen size to capture
options.add_argument('--window-size=1280,1024')

driver = webdriver.Chrome(options=options)

#Specify URL
driver.get("https://www.google.co.jp/")

#Specify the capture file name and extension
driver.save_screenshot('googletop.png')


driver.quit()

③ After pasting, save the vim editor below and finish. esc + :wq + Enter

④ Execute the created file

$ python3 capture.py

#Success if the following files are created in the same directory
$ ls
googletop.png

You can scrape relatively simply. After that, change the URL, change the elements to be extracted, and customize.

Recommended Posts

[EC2] Introduction to scraping using selenium (text extraction and screen capture)
Scraping using lxml and saving to MySQL
[EC2] How to take a screen capture of your smartphone with selenium
I tried web scraping using python and selenium
Introduction to Web Scraping
[Python] Introduction to scraping | Program to open web pages (selenium webdriver)
Start to Selenium using python
Web scraping using Selenium (Python)
Introduction to discord.py (3) Using voice
Operate Firefox with Selenium from python and save the screen capture
[Introduction to Python3 Day 1] Programming and Python
Scraping with Python, Selenium and Chromedriver
[EC2] How to run selenium webdriver
Selenium and python to open google
I learned scraping using selenium to make a horse racing prediction model.