Click here until yesterday
You will become an engineer in 100 days-Day 70-Programming-About scraping
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time is a continuation of scraping.
We will scrape in the Python language. Since scraping involves communication You need to know how communication works.
Websites are located on servers around the world.
On the WEB, communication with the server is basically performed using the protocol (communication protocol)
calledHTTP (HyperText Transfer Protocol)
.
Request
a request from the browser to the server
The response from the server to the browser is called the response
.
Basic exchanges on the WEB are established by request / response (R / R) It's basically achieved by exchanging text messages
** Site search example **
Perform a search with the search tool from your browser Request
The server responds to the request with a result response
The browser displays the search results based on the response
There are several specifications for HTTP communication
, and there are multiple ways to send requests.
** GET communication **
GET requests by adding parameters to the URL
Example:
http://otupy.com?p=abc&u=u123
After?, It is a parameter, and the parameter is a key = value
connected with&
.
** POST communication **
POST is included in the body and requested
http://otupy.com
Request Body
param:p:ab,u:u123
** Use POST and GET properly ** Communication itself is done by selecting an appropriate communication method in the browser The program must specify the communication method.
A request from a browser to the server of a website is called a request.
When you open a web page in your browser, the browser sends a request message to the server, such as:
GET example:
Request header
GET http://www.otupy.com/ex/http.htm HTTP/1.1
Host: www.otupy.com
Proxy-Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8
Referer: https://www.google.co.jp/
Accept-Encoding: gzip, deflate
Accept-Language: ja,en-US;q=0.9,en;q=0.8
POST example:
Request header:
POST /hoge/ HTTP/1.1
Host: localhost:8080
Connection: keep-alive
Content-Length: 22
Cache-Control: max-age=0
Origin: http://localhost:8080
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Referer: http://localhost:8080/hoge/
Accept-Encoding: gzip, deflate, br
Accept-Language: ja,en-US;q=0.8,en;q=0.6
Request body:
name=hoge&comment=hoge
The request has a header
and a body
part, and what kind of information is packed and sent depends on the communication method.
Therefore, it is necessary to fill in the appropriate information and make a request when accessing it programmatically.
Let's try scraping right away.
In Python, you can communicate with a library called requests
.
import requests
Since the website to be accessed is required, specify it and communicate with GET
.
requests.get(URL)
url = 'http://www.otupy.net/'
res = requests.get(url)
print(res)
<Response [200]>
As a result of communication, a response
is returned.
If the communication is successful, you can get the information of the access destination.
Of course, it is communication, so it may fail.
As a result of communication, the response
is divided into several status codes.
Communication is successful in the 200s, but in the 400s and 500s
Since the communication has failed, it is necessary to check if the URL is entered incorrectly or if the server of the other party can be accessed.
Classification | number | message | Description |
---|---|---|---|
information | 100 | Continue | Processing is continuing. Please send a further request. |
information | 101 | Sitching Protocols | Change to the protocol specified in the Upgrade header and request again. |
success | 200 | OK | Succeeded. |
success | 201 | Created | The new content has been created in the location specified in the Location header. |
success | 202 | Accepted | The request has been accepted. However, the process is not completed. |
success | 203 | Non-Authoritative Information | The response headers are different from what the original server returned, but the process is successful. |
success | 204 | No Content | There is no content, but the process was successful. |
success | 205 | Reset Content | Now that the request has been accepted, please discard the current content (screen). .. |
success | 206 | Partial Content | Only part of the content will be returned. |
transfer | 300 | Multiple Choices | There are multiple options for how to get the content. |
transfer | 301 | Moved Permanently | You have moved to another location specified in the Location header. |
transfer | 302 | Found | Found in another location specified in the Location header. Please look there. |
transfer | 303 | See Other | Look elsewhere in the Location header. |
transfer | 304 | Not Modified | Not updated. If-Modified-It will be returned if you use the Since header. |
transfer | 305 | Use Proxy | Use the proxy specified in the Location header. |
transfer | 306 | (Unused) | unused. |
transfer | 307 | Temporary Redirect | I'm temporarily moving to another location. |
Client error | 400 | Bad Request | The request is invalid. |
Client error | 401 | Unauthorized | Not authenticated. |
Client error | 402 | Payment Required | Payment is required. |
Client error | 403 | Forbidden | Access is not allowed. |
Client error | 404 | Not Found | Not found. |
Client error | 405 | Method Not Allowed | The specified method is not supported. |
Client error | 406 | Not Acceptable | Not allowed. |
Client error | 407 | Proxy Authentication Required | Proxy authentication is required. |
Client error | 408 | Request Timeout | The request has timed out. |
Client error | 409 | Conflict | The request has a conflict. |
Client error | 410 | Gone | The requested content is gone. |
Client error | 411 | Length Required | Content-Please add a Length header and request. |
Client error | 412 | Precondition Failed | If-...Did not meet the conditions specified in the header. |
Client error | 413 | Request Entity Too Large | The requested entity is too large. |
Client error | 414 | Request-URI Too Long | The requested URI is too long. |
Client error | 415 | Unsupported Media Type | Unsupported media type. |
Client error | 416 | Requested Range Not Satisfiable | The requested range is invalid. |
Client error | 417 | Expectation Failed | The extension request specified in the Expect header has failed. |
Server error | 500 | Internal Server Error | An unexpected error has occurred on the server. |
Server error | 501 | Not Implemented | Not implemented. |
Server error | 502 | Bad Gateway | The gateway is invalid. |
Server error | 503 | Service Unavailable | Service is not available. |
Server error | 504 | Gateway Timeout | The gateway has timed out. |
Server error | 505 | HTTP Version Not Supported | This HTTP version is not supported. |
Now let's check the communication result programmatically.
Response variable .status_code
You can check the status code at.
url = 'http://www.otupy.net/'
res = requests.get(url)
print(res.status_code)
200
If it is not 200, it means that the information on the website cannot be obtained because the communication has failed.
If the number is 200, the communication is successful and you can see the information obtained from the website.
Since the communication result is stored in a variable, you can see various contents.
Request URL
Response variable .url
Status code
Response variable .status_code
Get response body in text format
Response variable .text
Get the response body in binary format
Response variable .content
cookie
Response variable .cookies
Get encoding information
Response variable .encoding
From here onward, we will use the acquired text information to divide it into the necessary information.
#Get the response in binary format, convert it to characters and display it(1000 characters)
print(res.content[0:1000].decode('utf-8'))
....
When making a request, you can make a request by packing information in the request header and body part.
To request by specifying the header in GET communication, do as follows.
requests.get (url, headers = dictionary type header data)
Specify this to modify and access the user agent
as header information.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36 '}
res = requests.get(url, headers=headers)
When changing the parameters in GET communication and communicating, specify as follows.
requests.get (url, params = dictionary type parameter data)
params = {'key1': 'value1', 'key2': 'value2'}
res = requests.get(url, params=params)
To make a request by packing information in the request body part by POST communication, do as follows.
requests.get (url, data = dictionary type body data)
payload = {'send': 'data'}
res = requests.post(url, data=payload)
Let's be able to acquire information by suppressing the communication mechanism required for scraping. Tomorrow, we will start to extract the necessary information from the information acquired in this continuation.
29 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts