When accessing a URL containing Japanese (Japanese URL) with python3, it will be encoded in html without permission and an error will occur, so make a note of the workaround.

Contents

background

Stumble content

response = urllib.request.urlopen(url)

it is normal. Just access the url and do the object. ___ However, ___ A tragedy happened because this url contained Japanese.

url ='http://image.search.yahoo.co.jp/search?p=Evangelion' It's like that.

You will be dragged into the darkness of python with haste. *** Added error details. *** ***

Traceback (most recent call last):
・ ・ ・
    response = urllib.request.urlopen(link)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 465, in open
    response = self._open(req, data)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 483, in _open
    '_open', req)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 443, in _call_chain
    result = func(*args)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 1240, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 1083, in request
    self._send_request(method, url, body, headers)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 1118, in _send_request
    self.putrequest(method, url, **skips)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 960, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-21: ordinal not in range(128)

As far as I see the error, ~~ urllib is just trying to convert to ascii, right? ?? ?? ~~ PS: http was trying to convert the URL to ascii! !!

Workaround Where! I searched for The Japanese part should be parsed ~~. ~~ Postscript: You should do URL encoding (percent encoding).

urllib.parse.quote_plus('Evangelion', encoding='utf-8')

Is it like that? There is a problem with this. .. ..

url = 'http://image.search.yahoo.co.jp/search?p=' + urllib.parse.quote_plus('Evangelion', encoding='utf-8')

If you do it honestly, it will be like this. .. .. You can also specify a character string to exclude when you look it up! It seems that you should pass it as the second argument.

urllib.parse.quote_plus(url, "/:?=&")

Is it like that? There may be some omissions in characters that are not covered. .. .. It worked with this, but I was a little worried, so there is another method.

On the contrary (?) I should replace all Japanese! I tried to do that.

What i did

It's confusing! However, with this method, words that match the regular expression You can replace it "pass it to a function and with the result".

I wanted to do something about it, but I couldn't think of it as a stiff head. .. .. I don't know much about python, so it's not good at first glance. .. .. It seems that lambda also has no side effects. Please let me know if there is anything else. Is it an iterator?

regex = r'[Ah-Gaa-熙]'
matchedList = re.findall(regex,url)
for m in matchedList:
   url = url.replace(m, urllib.parse.quote_plus(m, encoding="utf-8"))

When it comes to all Japanese There are many articles that write [A-n], Looking at the character code table, it's a rainy day!

so! !! Even if you expose dirty code with python who is not familiar at all I wrote it because I want to share this last surprise.

Postscript: Correct specification method of regular expression

@KeisukeKudo-san gave me some improvement measures, so I will introduce them here as well! Strictly speaking, my notation is leaky, so if you want to use it, please use the following.

regex = r'[Ah-Gaa-熙]'
#Changed the above as follows
regex = r'[^\x00-\x7F]'

How about trying [\x00-\x7F] This is a regular expression that matches the ascii character. By using the negative form above, you can get the characters that match Japanese. http://rubular.com/r/2dnoBUlKe9

Postscript: The most correct method for this response

@ komeda-shinji gave me some improvement measures, so I will introduce them here as well! Thinking specifically about what you want to do, when there are characters in the URL query that cannot be converted to ascii, The following is better because it means that the URL is encoded first.

It is decomposed by the URL component and only the query is URL-encoded and reconstructed.

from urllib.parse import urlparse
import urllib.request

url = 'http://image.search.yahoo.co.jp/search?p=Evangelion'
p = urlparse(url)
query = urllib.parse.quote_plus(p.query, safe='=&')
url = '{}://{}{}{}{}{}{}{}{}'.format(
    p.scheme, p.netloc, p.path,
    ';' if p.params else '', p.params,
    '?' if p.query else '', query,
    '#' if p.fragment else '', p.fragment)
response = urllib.request.urlopen(url)

Recommended Posts

When accessing a URL containing Japanese (Japanese URL) with python3, it will be encoded in html without permission and an error will occur, so make a note of the workaround.
Check the argument type annotation when executing a function in Python and make an error
Get the stock price of a Japanese company with Python and make a graph
[Python] If you create a file with the same name as the module to be imported, an Attribute Error will occur.
I tried to find out the difference between A + = B and A = A + B in Python, so make a note
When I try to execute the make command of Makefile with os / exec of golang, the second and subsequent executions result in an error.
I came across an image filter with a clearly Japanese name called Kuwahara filter, and when I tried it, it was amazing, so I will introduce it.
[Python] Wouldn't it be the best and highest if you could grasp the characteristics of a company with nlplot?
[Python] Precautions when finding the maximum and minimum values in a numpy array with a small number of elements
Make a Python program a daemon and run it automatically when the OS starts
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
An introduction to cross-platform GUI software made with Python / Tkinter! (And many Try and Error)! (In the middle of writing)
[Note] Export the html of the site with python.
Make a copy of the list in Python
The result of making a map album of Italy honeymoon in Python and sharing it
A note that runs an external program in Python and parses the resulting line
I don't like to be frustrated with the release of Pokemon Go, so I made a script to detect the release and tweet it
A script that pings the registered server and sends an email with Gmail a certain number of times when it fails
I wrote python3.4 in .envrc with direnv and allowed it, but I got a syntax error
Make a note of what you want to do in the future with Raspberry Pi
When reading an image with SimpleITK, there is a problem if there is Japanese in the path
Recursively get the Excel list in a specific folder with python and write it to Excel.
VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
Return the image data with Flask of Python and draw it to the canvas element of HTML
An easy way to view the time taken in Python and a smarter way to improve it
Precautions when inputting from CSV with Python and outputting to json to make it an exe
How to write when you want to put a number after the group number to be replaced with a regular expression in re.sub of Python