Web scraping with Python First step

This article is for beginners of web scraping using Python3 and BeautifulSoup4.

I referred to past articles, Since a warning was displayed or it did not work due to the difference in version, I tried to summarize it again.

Overview

The basic process of web scraping is as follows.

① Get the web page. (2) Divide the elements of the acquired page and extract any part. ③ Save in the database.

Use request to get the web page of ① and BeautifulSoup4 to process ②. Since ③ differs depending on the environment, the explanation is omitted in this article.

Preparation

After installing Python3 Use the pip command to install the three packages BeautifulSoup4, requests and lxml.

$ pip install requests 
$ pip install lxml
$ pip install beautifulsoup4

Program execution

Create the following script file.

sample.py


import requests
from bs4 import BeautifulSoup

target_url = 'http://example.co.jp'  #example.co.jp is a fictitious domain. Change to any url
r = requests.get(target_url)         #Get from the web using requests
soup = BeautifulSoup(r.text, 'lxml') #Extract elements

for a in soup.find_all('a'):
	print(a.get('href'))         #Show link

Start a command prompt and execute the following command.

$ python sample.py 

After running, if you see the page link on the console, you're good to go!

Beautiful Soup method

Here are some useful methods for BeautifulSoup.

soup.a.string          #Change the character string of the a tag
soup.a.attrs            #Change all attributes
soup.a.parent          #Parent element returns

soup.find('a')          #The first element is returned
soup.find_all(id='log')    #All elements are returned

soup.select('head > title')   #Specified by css selector

BeautifulSoup has many other methods you can use. For details, please refer to the official document. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Narrow down the elements

It is convenient to use the regular expression of re to narrow down the target element.

import re
soup.find_all('a', href=re.compile("^http"))     #Links that start with http
import re
soup.find_all('a', href=re.compile("^(?!http)")) #Does not start with http(denial)
import re
soup.find_all('a', text=re.compile("N"), title=re.compile("W")) #Elements where text contains N and title contains W

Manipulating strings

A supplementary explanation of string operations that are useful to remember when scraping.

-Removed spaces before and after characters
"  abc  ".strip()
→abc
・ Split characters
"a, b, c,".split(',') 
→[a, b, c]
・ Search for character strings
"abcde".find('c') #Returns the position if there is a specified character.
→2
・ Character replacement
"abcdc".replace('c', 'x')
→abxdx

Referenced articles

http://qiita.com/itkr/items/513318a9b5b92bd56185

Recommended Posts

Web scraping with Python First step
Web scraping with python + JupyterLab
Web scraping beginner with python
Scraping with Python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Python
I tried web scraping with python.
[GUI with Python] PyQt5-The first step-
Python beginners get stuck with their first web scraping
WEB scraping with Python (for personal notes)
Getting Started with Python Web Scraping Practice
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
[For beginners] Try web scraping with Python
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Scraping RSS with Python
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
I tried scraping with Python
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Web scraping notes in python3
Scraping with chromedriver in python
Save images with web scraping
Scraping with Selenium in Python
First neuron simulation with NEURON + Python
Easy web scraping with Scrapy
Scraping with Tor in Python
Web API with Python + Falcon
Web scraping using Selenium (Python)
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
Web application with Python + Flask ② ③
I tried scraping with python
Streamline web search with python
Web application with Python + Flask ④
Data analysis for improving POG 1 ~ Web scraping with Python ~
Quick web scraping with Python (while supporting JavaScript loading)
Try scraping with Python + Beautiful Soup
Scraping with Node, Ruby and Python
Scraping with Selenium in Python (Basic)
Web scraping with BeautifulSoup4 (layered page)
Scraping with Python, Selenium and Chromedriver
Getting Started with Python Web Applications
Scraping Alexa's web rank with pyQuery
Scraping with Python and Beautiful Soup
Monitor Python web apps with Prometheus
Get web screen capture with python
[Scraping] Python scraping
Web crawling, web scraping, character acquisition and image saving with python
Let's do image scraping with Python
C / C ++ programmer challenges Python (first step)