Notes on doing Japanese OCR with Python

I will summarize the steps I took to do Japanese OCR with python using the free tesseract OCR.

environment

Installation

Install tesseract.

Installation policy

How to install

  1. Install with apt-get
  2. Build and install from source

There are two. The version that can be installed with 1 apt-get is 3.0.3. To handle Japanese with tesseract, data trained in Japanese (jpn.traindata) is required. I have to download this myself, but only the one found on the net is ver3.0.4. When I try to use this data in 3.03, it doesn't work and I get this error:

read_params_file: parameter not found: allow_blob_division

You can also edit traindata and use it in 3.0.3 like this person, but it is necessary for that. The command `` `combine_tessdata``` cannot be used with apt-get installations. Therefore, if you want to do it in Japanese at present, you may have to install it from the source.

Basically, install tesseract 3.0.4 by referring to the official compile installation page.

https://github.com/tesseract-ocr/tesseract/wiki/Compiling

Dependency installation

$ sudo apt-get install autoconf automake libtool
$ sudo apt-get install libpng12-dev
$ sudo apt-get install libjpeg62-dev
$ sudo apt-get install libtiff4-dev
$ sudo apt-get install zlib1g-dev
$ sudo apt-get install libicu-dev      # (if you plan to make the training tools)
$ sudo apt-get install libpango1.0-dev # (if you plan to make the training tools)
$ sudo apt-get install libcairo2-dev   # (if you plan to make the training tools)

Laptonica installation

It seems that you need an image library called Laptonica. Download and unzip the source from the Download Page (http://www.leptonica.org/download.html). To install tesseract3.0.4, you need at least Laptonica 1.71, so install the latest 1.7.3.

#Defrost
gzip -dc leptonica-1.73.tar.gz |tar xvf -
cd leptonica-1.73

#like make
$ ./configure
$ make
$ sudo make install

tesseract installation

Basically, do as this.

Get the source of 3.0.4 from here

#Unzip, move
$ unzip 3.04.zip 
$ cd tesseract-3.04

#Put it through the library path
$ export -p LD_LIBRARY=$LD_LIBRARY:/usr/local/lib

#Installation
$ ./autogen.sh
$ ./configure
$ sudo make          #I made sudo only here. I couldn't find laptonica.
$ sudo make install
$ sudo ldconfig

Acquisition and setting of Japanese files

Download the Japanese version of jpn.traindata from the language dataset at here and place it here.

/usr/local/share/tessdata/

And set the path of this folder.

export TESSDATA_PREFIX="/usr/local/share/tessdata/tessdata/

Operation check

If the installation is successful, you should be able to run OCR on the command line. I will try this image on Japanese OCR. ocr_test.png

tesseract ocr_test.png out -l jpn

Will write the results to a file called out.txt.

out.txt


Smile is the best!Reni Takagi

The small "ya" becomes the large "ya", but it is generally recognizable. Is it difficult because there is no concept of lowercase letters in other English?

Introduction of pyocr

We use a wrapper library called pyocr for use with python.

Installation is

$ pip install pyocr

That's it.

However, it does not support tesseract installed from source, and when I run the following error.py for testing, it does not work.

error.py


import pyocr
tools = pyocr.get_available_tools()
Traceback (most recent call last):
  File "error.py", line 12, in <module>
    tools = pyocr.get_available_tools()
  File "/usr/local/lib/python2.7/site-packages/pyocr/pyocr.py", line 74, in get_available_tools
    if tool.is_available():
  File "/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/__init__.py", line 152, in is_available
    version = get_version()
  File "/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/__init__.py", line 179, in get_version
    upd = int(version[2])
ValueError: invalid literal for int() with base 10: '02dev'

When I read the error, I'm angry trying to convert the string "02dev" to an int. The version installed from source is tesseract 3.04.02dev, and it doesn't seem to assume the dev package. So I'll change this source.

If you are using virtualenv, replace the source to be changed appropriately.

py:/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/__init__.py


    if len(version) >= 3:
        upd = int(version[2].replace('dev', ''))
        # upd = int(version[2])

This should work.

Try OCR

There are various OCR mechanisms, so I will try them. I will try it with this image.

mlct-14-638.jpg

Text The simplest OCR. Reads a character from the image and returns it as a string.

import pyocr
import pyocr.builders
import argparse
from PIL import Image

parser = argparse.ArgumentParser(description='tesseract ocr test')
parser.add_argument('image', help='image path')
args = parser.parse_args()

tools = pyocr.get_available_tools()

if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]

res = tool.image_to_string(Image.open(args.image),
                           lang="jpn",
                           builder=pyocr.builders.TextBuilder(tesseract_layout=6))

print res

result

Lord of the machine training
Tess Wota
Next to the door screaming
rope
Ship Day^~Genus~Customary betting history ba "Mae

The result is terrible, probably because of difficult words.

WordBox

It will return a box where the word is. Let's visualize the result with openCV. (Install openCV at here)

import pyocr
import pyocr.builders
import argparse
import cv2
from PIL import Image

parser = argparse.ArgumentParser(description='tesseract ocr test')
parser.add_argument('image', help='image path')
args = parser.parse_args()


tools = pyocr.get_available_tools()

if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]


res = tool.image_to_string(Image.open(args.image),
                           lang="jpn",
                           builder=pyocr.builders.WordBoxBuilder(tesseract_layout=6))

# draw result 
out = cv2.imread(args.image)
for d in res:
    print d.content
    print d.position
    cv2.rectangle(out, d.position[0], d.position[1], (0, 0, 255), 2)

cv2.imshow('image',out)
cv2.waitKey(0)
cv2.destroyAllWindows()

Screenshot from 2016-07-20 15:12:12.png

Lord of the machine training
((226, 12), (412, 37))
Tess
((255, 138), (278, 148))
Wota
((283, 137), (326, 148))
door
((397, 149), (406, 159))
Screaming next door
((411, 149), (430, 159))
Historical training
((477, 148), (523, 159))
rope
((165, 170), (199, 181))
Ship Day
((115, 202), (156, 212))
^~Genus~
((210, 196), (247, 220))
Customary betting history
((297, 202), (343, 213))
Ba "Mae
((390, 203), (438, 212))

The territory is decent, but the recognized words are still terrible.

LineBox WordBox was word-by-word, but LineBox seems to group words on the same line.

I will change only a part of the source of WordBox Just change from WordBoxBuilder to LineBoxBuilder.

res = tool.image_to_string(Image.open(args.image),
                           lang="jpn",
                           builder=pyocr.builders.LineBoxBuilder(tesseract_layout=6))


result Screenshot from 2016-07-20 15:34:57.png

Lord of the machine training
((226, 12), (412, 37))
Tess Wota
((255, 137), (326, 148))
Next to the door screaming
((397, 148), (523, 159))
rope
((165, 170), (199, 181))
Ship Day^~Genus~Customary betting history ba "Mae
((115, 196), (438, 220))

This image doesn't have to be on the same line, but it seems to be useful for multi-line sentences.

About tesseract_layout

For each builder, tesseract_layout = 6 Was set. This number seems to set the policy of OCR for images,

This person has put together. http://tanaken-log.blogspot.jp/2012/08/imagemagick-tesseract.html

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

About learning data

As you can see, the accuracy is not good when using the existing Japanese data. If you create the learning data yourself, it will be more decent.

http://hadashi-gensan.hatenablog.com/entry/2014/01/15/135316

Bonus Google Cloud Vision

If you use TEXT_DETECT of Google Cloud Vision API, it will look like this.

Screenshot from 2016-07-21 11:28:43.png

machine
Learning
of
flow
test
data
Preliminary
Measurement
vessel
Learning
result
Before
processing
teacher
data
Raw
data
machine
Learning
Parameters
L
Reason
A

After all accuracy is good. If you want to process it easily without making so many requests, you should use the Vision API.

Recommended Posts

Notes on doing Japanese OCR with Python
Notes on using rstrip with python.
Send Japanese email with Python3
Notes on HDR and RAW image processing with Python
[Python] Notes on data analysis
Notes on installing Python on Mac
Japanese morphological analysis with Python
Notes on installing Python on CentOS
OpenJTalk on Windows10 (Speak Japanese with Python from environment construction)
Notes on importing data from MySQL or CSV with Python
Notes on handling large amounts of data with python + pandas
Notes on deploying pyenv with Homebrew and managing Python versions
Notes on Python and dictionary types
Notes on package management with conda
Notes on using MeCab from Python
Draw Japanese with matplotlib on Ubuntu
Notes on installing Python using PyEnv
Notes on accessing dashDB from python
Speak Japanese text with OpenJTalk + python
Getting started with Python 3.8 on Windows
[Memo] Tweet on twitter with python
Japanese file enumeration with Python2 system on Windows (5C problem countermeasure)
Notes for using OpenCV on Windows10 Python 3.8.3.
Run servo with Python on ESP32 (Windows)
Notes on PyQ machine learning python grammar
Notes on running M5Stick V with uPyLoader
Notes on nfc.ContactlessFrontend () for nfcpy in python
Generate Japanese test data with Python faker
A memo with Python2.7 and Python3 on CentOS
Map rent information on a map with python
Download Japanese stock price data with python
Follow active applications on Mac with Python
[C] [python] Read with AquesTalk on Linux
scipy stumbles with pip install on python 2.7.8
Notes on building Python and pyenv on Mac
How to display python Japanese with lolipop
Download files on the web with Python
Build Python environment with Anaconda on Mac
[Python] Let's make matplotlib compatible with Japanese
How to enter Japanese with Python curses
[Python] Japanese localization of matplotlib on Ubuntu
Installing PIL with Python 3.x on macOS
Notes on using code formatter in Python
Read text in images with python OCR
Working with GPS on Raspberry Pi 3 Python
Read the file with python and delete the line breaks [Notes on reading the file]
Getting started with Python with 100 knocks on language processing
Python scraping notes
FizzBuzz with Python3
Discord bot with python raspberry pi zero with [Notes]
Scraping with Python
Python study notes _000
Statistics with python
Strategy on how to monetize with Python Java
Build python environment with pyenv on EC2 (ubuntu)
Python on Windows
Try it with Word Cloud Japanese Python JupyterLab.
twitter on python3
Scraping with Python
Notes about with
Python with Go