[PYTHON] Character code † darkness † encounter report part1

Status

On Windows7 64bit, Python 2.7.5 The crawler that collects Foursquare Venues that I have been using for a long time has been turned into a tool. I only use it myself, so I / O is fairly appropriate. Input is latitude / longitude of upper right and lower left of bbox, output is ID, Venue name, latitude / longitude, genre, Separated by commas, assuming that is written to csv. Since it is supposed to be executed on the command line, the output is standard output so that it can be written to a file by redirection. When waiting for input, the text is output as an error (displayed as "latitude at the northeastern end") as to what to input. The source code is written in utf-8.

Symptoms of † darkness †

Execute as follows on the command line.

`command1`


$ python foursquare_crawler.py > washington_venues.csv
Chrysalis pupa

Garbled characters ... The source code of this part is as follows.

`source1`


sys.stderr.write("Northeastern latitude:")
first_ne_lat = float(sys.stdin.readline())

I simply thought that the character code of the command prompt was bad, so I temporarily changed the character code of the command prompt from cp932 (Shift-JIS) to utf-8. It seems that utf-8 is called cp65001 on Windows ...

`command2`


$ chcp 65001
Active code page: 65001

Run again.

`command3`


$ python foursquare_crawler.py > washington_venues.csv
åŒ—æ±ç«¯ã®ç·¯åº¦ï¼š

… (´ ・ ω ・｀) This time, explicitly describe the error output as utf-8 in the program.

`source2`


sys.stderr = codecs.getwriter('utf-8')(sys.stderr)
sys.stdin = codecs.getwriter('utf-8')(sys.stdin)
sys.stderr.write("Northeastern latitude:")
first_ne_lat = float(sys.stdin.readline())

I also wrote that standard input should be received in utf-8. But no change ...

Solution

I wondered what happened, so I made it Unicode somehow.

`source3`


sys.stderr.write(u"Northeastern latitude:")
first_ne_lat = float(sys.stdin.readline())

Then, restore the character code of the command prompt and execute it.

`command3`


$ chcp 932
Current code page: 932
$ python foursquare_crawler.py > washington_venues.csv
Northeastern latitude:

It's done (゜ ∀ ゜)! !! !! I haven't investigated it in detail, but since character strings are usually managed for each byte, it seems that there are various problems with full-width characters such as Japanese and half-width characters such as the alphabet when dividing a character string into multiple bytes. However, with Unicode, one character is processed by one character, so it seems that Unicode is better when dealing with Japanese in Python.

Now let's make all the strings Unicode ...