Notes using cChardet and python3-chardet in Python 3.3.1.

There is a Python library called chardet. By inputting the bytes string, it is possible to infer what character code the bytes string was written by encoding.

I wanted to use chardet with Python3, but the official is not yet compatible with Python3.

When I searched for it, I found a library called python3-chardet that forked chardet, so I decided to use it.

Installation

Download and install from github.

$ git clone [email protected]:bsidhom/python3-chardet.git 

In the directory created in

$ python setup.py install

Then the installation is completed.

Experiment

ipython3


import chardet

chardet.detect('abc'.encode('utf-8'))
> {'confidence': 1.0, 'encoding': 'ascii'}

chardet.detect('AIUEO'.encode('utf-8'))
> {'confidence': 0.9690625, 'encoding': 'utf-8'}

chardet.detect('AIUEO'.encode('Shift-JIS'))
> {'confidence': 0.5, 'encoding': 'windows-1252'}

It worked properly. I'm a little worried that'aiueo'.encode ('Shift-JIS') was judged to be windows-1252, but since the confidence is 0.5, chardet's confidence may be half-confident. The sentence was too short, so it can't be helped.

We conducted further experiments to see if it could be used when scripting web pages.

The target website is decided to be price.com http://kakaku.com/. It is just right because it uses Shift_JIS.

ipython3


import chardet
import requests

r = requests.get('http://kakaku.com')
chardet.detect(r.content)
> {'confidence': 0.99, 'encoding': 'SHIFT_JIS'}

He made a good decision. Unlike the example of'Aiueo'.encode ('Shift-JIS'), it correctly judged SHIFT_JIS instead of windows-1252 because it targeted a long bytes column for the entire Web page. Seem. Confidence has also increased.

Postscript

I later noticed that there is a Python library for C extensions called cChardet. Can be used with Python3. Py Yoshi is amazing.

It's on pypi, so you can get it at https://pypi.python.org/pypi/cchardet/ pip.

$ pip install cchardet

Since it's a big deal, I used the top page of Kakaku.com to compare the speeds. The code is as follows.

compare.py


import chardet
import cchardet
import requests
import time

if __name__ == '__main__':
    r = requests.get('http://kakaku.com')
    begin_time = time.clock()
    guessed_encoding = chardet.detect(r.content)
    end_time = time.clock()
    print('chardet: %f, %s' % (end_time - begin_time, guessed_encoding))

    begin_time_of_cc = time.clock()
    guessed_encoding_by_cc = cchardet.detect(r.content)
    end_time_of_cc = time.clock()
    print('cChardet: %f, %s' % (end_time_of_cc - begin_time_of_cc, guessed_encoding_by_cc))

And the result is as follows.

chardet: 1.440141, {'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
cChardet: 0.000589, {'confidence': 0.9900000095367432, 'encoding': 'SHIFT_JIS'}

Isn't it overwhelming?

Conclusion

Use cChardet! !! !!

Recommended Posts

Notes using cChardet and python3-chardet in Python 3.3.1.
Notes for using python (pydev) in eclipse
Notes on using code formatter in Python
Notes using Python subprocesses
Notes on installing Python3 and using pip on Windows7
Notes on using dict in python [Competition Pro]
Try using ChatWork API and Qiita API in Python
Python notes using perl-ternary operator
Web scraping notes in python3
Python notes using perl-special variables
Stack and Queue in Python
Read and write NFC tags in python using PaSoRi
Unittest and CI in Python
Get Evernote notes in Python
Translate using googletrans in Python
Using Python mode in Processing
Try to make it using GUI and PyQt in Python
Collect tweets using tweepy in Python and save them in MongoDB
Predict gender from name using Gender API and Pykakasi in Python
Graph time series data in Python using pandas and matplotlib
Notes on reading and writing float32 TIFF images in python
GUI programming in Python using Appjar
Notes on Python and dictionary types
MIDI packages in Python midi and pretty_midi
Difference between list () and [] in Python
Precautions when using pit in Python
Difference between == and is in python
View photos in Python and html
Sorting algorithm and implementation in Python
Authentication using tweepy-User authentication and application authentication (Python)
Notes on using MeCab from Python
Manipulate files and folders in Python
About dtypes in Python and Cython
Notes on using post-receive and post-merge
Try using LevelDB in Python (plyvel)
Assignments and changes in Python objects
Check and move directories in Python
Using global variables in python functions
Ciphertext in Python: IND-CCA2 and RSA-OAEP
Notes on installing Python using PyEnv
Hashing data in R and Python
Clustering and visualization using Python and CytoScape
Let's see using input in python
Infinite product in Python (using functools)
Function synthesis and application in Python
Edit videos in Python using MoviePy
Notes on using rstrip with python.
Export and output files in Python
(Personal notes) Python metaclasses and metaprogramming
Reverse Hiragana and Katakana in Python2.7
Reading and writing text in Python
[GUI in Python] PyQt5-Menu and Toolbar-
Handwriting recognition using KNN in Python
Try using Leap Motion in Python
Depth-first search using stack in Python
When using regular expressions in Python
Create and read messagepacks in Python
GUI creation in python using tkinter 2
Build and try an OpenCV & Python environment in minutes using Docker
I compared Node.js and Python in creating thumbnails using AWS Lambda
Overlapping regular expressions in Python and Java