[PYTHON] Create a Japanese OCR environment with Anaconda (tesseract + pyocr)

Overview

Build an OCR environment with Anaconda alone I don't know how difficult it is, so I'm looking for an easy way.

environment

Windows10 Anaconda Python 3.6 Spyder 4.1.2

About tesseract and pyocr

After investigating, there seems to be a way of tesseract + pyocr for OCR with Python, so I decided to try this method

tesseract It is an OCR (optical character recognition) engine currently being developed by Google. Since v4.0 or later is based on machine learning LSTM, Considering the recognition rate, the latest version seems to be good

pyocr OCR tool wrapper for Python Also supports tesseract

reference

Build an environment only with Anaconda and try Python + OCR https://qiita.com/anzanshi/items/9ee94affecd74be33159

I used it as a reference, but I was a little addicted to it because of the difference in environment.

Installation of tesseract

There seems to be various ways, but this time I will install it with Anaconda

There was tesseract in the conda-forge repository https://anaconda.org/conda-forge/tesseract

Install obediently (v4.1.1 as of April 14, 2020) conda install -c conda-forge tesseract

Install pyocr

This is a repository called brianjmcguirk that I rarely see ...? https://anaconda.org/brianjmcguirk/pyocr

This is also installed obediently (this is currently v0.5) conda install -c brianjmcguirk pyocr

Try running the code on the official page

Refer to the above article and check with the code on the official page

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.

And the execution result is

Execution result


Will use tool 'Tesseract (sh)'
Available languages: eng, osd
Will use lang 'eng'

It will be. As it is written, Japanese is not yet OCR in English only.

Japanese OCR environment creation

Now, let's do OCR in Japanese

Download trained data

Download jpn.traineddata from here It seems that the place has changed from the old days, so it was a little difficult to find. https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md

Note that the data differs depending on the version! (I made a mistake once ...)

Put it in the right place

This is also a little troublesome ... In my environment / Anaconda3 / envs / (environment name) / Library / bin / tessdata I was able to read it when I put it under it (There are already eng.traineddata and osd.traineddata)

There is also a tessdata directory under (environment name), It seems that this is not going to read

Re-execute

Run the code on the official page again

Execution result


Will use tool 'Tesseract (sh)'
Available languages: eng, jpn, osd
Will use lang 'eng'

"Jpn" has also been added properly Next, let's read Japanese

↓ Test image test.PNG

txt = tool.image_to_string(
    Image.open('test.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print( txt )

Execution result


raise TesseractError(status, errors)
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\r\n")

Unexpected error occurred ... This was also helpful to the article of the person who got a similar error https://xkage.com/python-ocr.html

tesseract.pyとbuilders.py I was able to rewrite "-psm" in "--psm"

Run again

test.PNG

txt = tool.image_to_string(
    Image.open('test.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print( txt )

Execution result


Test test

did it!

Summary

You can create an environment with Anaconda, but I'm quite addicted to it because there is less information than I expected. Well, I'm going to play hard with OCR

Supplement

There is also information that pyocr cannot be used if the python version is 3.7. It seems safe to create an environment with 3.6

Recommended Posts

Create a Japanese OCR environment with Anaconda (tesseract + pyocr)
[Python] Create a virtual environment with Anaconda
Create a virtual environment with Anaconda installed via Pyenv
Create a virtual environment with Python!
Create a virtual environment with Python_Mac version
Building a Python environment with WLS2 + Anaconda + PyCharm
Create a virtual environment with conda in Python
Create a python3 build environment with Sublime Text3
[Memo] Build a virtual environment with Pyenv + anaconda
Create a comfortable Python 3 (Anaconda) development environment on windows
Create a python development environment with vagrant + ansible + fabric
code-server Online environment (2) Create a virtual network with Boto3
Notes on creating a virtual environment with Anaconda Navigator
Create a django environment with docker-compose (MariaDB + Nginx + uWSGI)
Create a machine learning environment from scratch with Winsows 10
Create an environment with virtualenv
Create a homepage with django
Create a heatmap with pyqtgraph
Create a directory with python
Create a GO development environment with [Mac OS Big Sur]
Create a simple Python development environment with VSCode & Docker Desktop
Create a Todo app with Django ① Build an environment with Docker
Building a pyhon environment without using Anaconda (with easy startup)
How to quickly create a machine learning environment using Jupyter Notebook with UbuntuServer 16.04 LTS with anaconda
Building a kubernetes environment with ansible 2
Create a Python execution environment for Windows with VScode + Remote WSL
Build a Python environment on your Mac with Anaconda and PyCharm
Create a Python environment on Mac (2017/4)
Try to create a python environment with Visual Studio Code & WSL
Building a virtual environment with Python 3
Create a Linux environment on Windows 10
Create a python environment on centos
Create a development environment for Go + MySQL + nginx with Docker (docker-compose)
Create a poisson stepper with numpy.random
Building a kubernetes environment with ansible 1
How to quickly create a machine learning environment using Jupyter Notebook on macOS Sierra with anaconda
Create a C ++ and Python execution environment with WSL2 + Docker + VSCode
Create a simple Python development environment with VS Code and Docker
Create a USB boot Ubuntu with a Python environment for data analysis
Create a file uploader with Django
[AWS] Create a Python Lambda environment with CodeStar and do Hello World
Steps to create a Python virtual environment with VS Code on Windows
Create a Python3 environment with pyenv on Mac and display a NetworkX graph
Macbook Air with M1 is here! Quickly create a Python computing environment
Steps to build a Django environment with Win10 WSL Ubuntu18.04 + Anaconda + Apache2
[Pyenv] Building a python environment with ubuntu 16.04
Create a Python function decorator with Class
Change Python 64bit environment to 32bit environment with Anaconda
Building a Python3 environment with Amazon Linux2
Easily build a development environment with Laragon
Build a blockchain with Python ① Create a class
Vue.js + Flask environment construction memorandum ~ with Anaconda3 ~
Build a Tensorflow environment with Raspberry Pi [2020]
Create a python environment on your Mac
Let's create a virtual environment for Python
Let's create a free group with Python
Create a GUI app with Python's Tkinter
Building a Python 3.6 environment with Windows + PowerShell
[Python] Create a Batch environment using AWS-CDK
Build a Fast API environment with docker-compose
Notes on doing Japanese OCR with Python