[PYTHON] You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2

Click here until yesterday

This time is a continuation of scraping.

About communication

We will scrape in the Python language. Since scraping involves communication You need to know how communication works.

Websites are located on servers around the world. On the WEB, communication with the server is basically performed using the protocol (communication protocol) calledHTTP (HyperText Transfer Protocol).

Request a request from the browser to the server The response from the server to the browser is called the response.

Basic exchanges on the WEB are established by request / response (R / R) It's basically achieved by exchanging text messages

** Site search example ** Perform a search with the search tool from your browser Request The server responds to the request with a result response The browser displays the search results based on the response

There are several specifications for HTTP communication, and there are multiple ways to send requests.

** GET communication **

GET requests by adding parameters to the URL

Example: http://otupy.com?p=abc&u=u123 After?, It is a parameter, and the parameter is a key = value connected with&.

** POST communication **

POST is included in the body and requested

http://otupy.com

Request Body param:p:ab,u:u123

** Use POST and GET properly ** Communication itself is done by selecting an appropriate communication method in the browser The program must specify the communication method.

request

A request from a browser to the server of a website is called a request.

When you open a web page in your browser, the browser sends a request message to the server, such as:

GET example:

Request header
GET http://www.otupy.com/ex/http.htm HTTP/1.1
Host: www.otupy.com
Proxy-Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8
Referer: https://www.google.co.jp/
Accept-Encoding: gzip, deflate
Accept-Language: ja,en-US;q=0.9,en;q=0.8

POST example:

Request header:
POST /hoge/ HTTP/1.1
Host: localhost:8080
Connection: keep-alive
Content-Length: 22
Cache-Control: max-age=0
Origin: http://localhost:8080
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Referer: http://localhost:8080/hoge/
Accept-Encoding: gzip, deflate, br
Accept-Language: ja,en-US;q=0.8,en;q=0.6

Request body:
name=hoge&comment=hoge

The request has a header and a body part, and what kind of information is packed and sent depends on the communication method.

Therefore, it is necessary to fill in the appropriate information and make a request when accessing it programmatically.

Programmatic access

Let's try scraping right away.

In Python, you can communicate with a library called requests.


import requests

Since the website to be accessed is required, specify it and communicate with GET. requests.get(URL)

url = 'http://www.otupy.net/'
res = requests.get(url)
print(res)

As a result of communication, a response is returned. If the communication is successful, you can get the information of the access destination.

Of course, it is communication, so it may fail.

Communication result (response)

As a result of communication, the response is divided into several status codes. Communication is successful in the 200s, but in the 400s and 500s Since the communication has failed, it is necessary to check if the URL is entered incorrectly or if the server of the other party can be accessed.

Classification	number	message	Description
information	100	Continue	Processing is continuing. Please send a further request.
information	101	Sitching Protocols	Change to the protocol specified in the Upgrade header and request again.
success	200	OK	Succeeded.
success	201	Created	The new content has been created in the location specified in the Location header.
success	202	Accepted	The request has been accepted. However, the process is not completed.
success	203	Non-Authoritative Information	The response headers are different from what the original server returned, but the process is successful.
success	204	No Content	There is no content, but the process was successful.
success	205	Reset Content	Now that the request has been accepted, please discard the current content (screen). ..
success	206	Partial Content	Only part of the content will be returned.
transfer	300	Multiple Choices	There are multiple options for how to get the content.
transfer	301	Moved Permanently	You have moved to another location specified in the Location header.
transfer	302	Found	Found in another location specified in the Location header. Please look there.
transfer	303	See Other	Look elsewhere in the Location header.
transfer	304	Not Modified	Not updated. If-Modified-It will be returned if you use the Since header.
transfer	305	Use Proxy	Use the proxy specified in the Location header.
transfer	306	(Unused)	unused.
transfer	307	Temporary Redirect	I'm temporarily moving to another location.
Client error	400	Bad Request	The request is invalid.
Client error	401	Unauthorized	Not authenticated.
Client error	402	Payment Required	Payment is required.
Client error	403	Forbidden	Access is not allowed.
Client error	404	Not Found	Not found.
Client error	405	Method Not Allowed	The specified method is not supported.
Client error	406	Not Acceptable	Not allowed.
Client error	407	Proxy Authentication Required	Proxy authentication is required.
Client error	408	Request Timeout	The request has timed out.
Client error	409	Conflict	The request has a conflict.
Client error	410	Gone	The requested content is gone.
Client error	411	Length Required	Content-Please add a Length header and request.
Client error	412	Precondition Failed	If-...Did not meet the conditions specified in the header.
Client error	413	Request Entity Too Large	The requested entity is too large.
Client error	414	Request-URI Too Long	The requested URI is too long.
Client error	415	Unsupported Media Type	Unsupported media type.
Client error	416	Requested Range Not Satisfiable	The requested range is invalid.
Client error	417	Expectation Failed	The extension request specified in the Expect header has failed.
Server error	500	Internal Server Error	An unexpected error has occurred on the server.
Server error	501	Not Implemented	Not implemented.
Server error	502	Bad Gateway	The gateway is invalid.
Server error	503	Service Unavailable	Service is not available.
Server error	504	Gateway Timeout	The gateway has timed out.
Server error	505	HTTP Version Not Supported	This HTTP version is not supported.

Checking the communication result in the program

Now let's check the communication result programmatically.

Response variable .status_code You can check the status code at.


url = 'http://www.otupy.net/'
res = requests.get(url)
print(res.status_code)

200

If it is not 200, it means that the information on the website cannot be obtained because the communication has failed.

If the number is 200, the communication is successful and you can see the information obtained from the website.

Since the communication result is stored in a variable, you can see various contents.

Request URL Response variable .url

Status code Response variable .status_code

Get response body in text format Response variable .text

Get the response body in binary format Response variable .content

cookie Response variable .cookies

Get encoding information Response variable .encoding

From here onward, we will use the acquired text information to divide it into the necessary information.

#Get the response in binary format, convert it to characters and display it(1000 characters)
print(res.content[0:1000].decode('utf-8'))

....

Custom header

When making a request, you can make a request by packing information in the request header and body part.

To request by specifying the header in GET communication, do as follows.

requests.get (url, headers = dictionary type header data)

Specify this to modify and access the user agent as header information.

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36 '}
res = requests.get(url, headers=headers)

When changing the parameters in GET communication and communicating, specify as follows.

requests.get (url, params = dictionary type parameter data)

params = {'key1': 'value1', 'key2': 'value2'}
res = requests.get(url, params=params)

To make a request by packing information in the request body part by POST communication, do as follows.

requests.get (url, data = dictionary type body data)

payload = {'send': 'data'}
res = requests.post(url, data=payload)

Summary

Let's be able to acquire information by suppressing the communication mechanism required for scraping. Tomorrow, we will start to extract the necessary information from the information acquired in this continuation.

29 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython