I used to read html with open-uri, parse it with Nokogiri, and scrape it. It was.
When reading html with open-uri, it is read in binary, so I focused on what happens when html read in binary is encoded with nil
and parsed with Nokogiri.
Let's set the encoding when parsing html with Nokogiri as nil
. Set nil
to the third argument of HTML.parse.
The following html is loaded this time. The file is written in Shift_JIS.
hello.html
<html>
<head>
<title>Hello</title>
<meta charset="Shift_JIS">
</head>
<body>
</body>
</html>
Load html in binary mode. You can read the file in binary by adding the 'rb'
option to the open method.
For verification, let's display the character code at the time of reading in binary mode and the character code after parsing with Nokogiri.
sample.rb
require 'nokogiri'
html = open('hello.html', 'rb').read
p html.encoding
p Nokogiri::HTML.parse(html, nil, nil).encoding
Execution result
sample.rb result
$ ruby sample.rb
#<Encoding:ASCII-8BIT>
"Shift_JIS"
From the result, it can be confirmed that the encoding of the read html itself is ASCII-8BIT, but the encoding after parsing by Nokogiri is Shift_JIS, which is the same as the original file.
By the way, even if you omit the argument as HTML.parse (html)
, you can get the same result as above.
Looking at the verification results above, Nokogiri goes to refer to some character code by himself. Where are you referring to?
Actually, I am going to refer to the meta element of the original html file.
It refers to the charset of <meta charset =" Shift_JIS ">
.
Try changing the charset part to UTF-8 and output the character code in the same way as before.
hello.html
<html>
<head>
<title>Hello</title>
<meta charset="UTF-8">
</head>
<body>
</body>
</html>
Execution result
sample.rb result
$ ruby sample.rb
#<Encoding:ASCII-8BIT>
"UTF-8"
You can see that the character code after parsing has changed to UTF-8.
By the way, when I try to eliminate charset, ...
hello.html
<html>
<head>
<title>Hello</title>
<meta>
</head>
<body>
</body>
</html>
sample.rb result
$ ruby sample.rb
#<Encoding:ASCII-8BIT>
nil
The character code after parsing has become nil. Of course, if you display the title etc. in this state, the characters will be garbled.
--If you read html in binary and set Nokogiri's encoding to nil
, Nokogiri will go to refer to the character code by itself.
--Nokogiri goes to refer to the charset of the meta element of the html read in binary.
--If charset is not written, Nokogiri's encoding will be nil
.
Recommended Posts