[PYTHON] I tried to extract the text in the image file using Tesseract of the OCR engine

background

I was able to quickly extract text data from an image file (scanning form data) using Tesseract, so I organized it as a memorandum.

-Tesseract is an OCR engine that also supports Japanese. --Open source ** License ** (Related article 1.) is "** Apache License 2.0 " and can be used for commercial purposes. -From ** Tesseract 4 , ** RNN ** (Recurrent Neural Network) has been extended to [ LSTM ** (Long short-term memory)](https://qiita.com/t_Signull/items/ A ** OCR engine ** ( AI-OCR **) based on 21b82be280b46f467d1b) is also installed, and by using this, extraction accuracy can be expected (I think). --Supported languages are from "** tesseract / doc / tesseract.1.asc **" When I counted it, there were 117 (as of July 25, 2020).

1. 1. Introduction

--First, see "** tesseract-ocr / tesseract **" --Among them, in the Window environment, from the older versions of Tesseract at UB Mannheim linked from the reference, the latest (v5) Download the installer for the stable version (tesseract-ocr-w64-setup-v5.0.0.20190623.exe) --For the subsequent work, refer to "How to install Tesseract OCR on Windows (Related article 2.)).

2. Run

--In execution, as test data, the test image data introduced in [Excel Macro] Extract text in image with VBA + OCR Leverage --Before execution, refer to "How to execute OCR with Python (Related article 3.)" for execution, and python code. Create --The ** PyOCR ** package is required to use ** tesseract ** from python. ――By using this package, it is possible (likely) to improve the ** extraction accuracy of text ** (remove garbage). --The execution result is described below

C:\Users\xxx\work>python ocr_card.py test_data_3.png
It is possible to analyze the character string described on the image with a program and acquire only the text as a character string.!This time, I would like to introduce a character recognition method that disguises itself as a friend and looks at it.
The character string described on the image
Analyzed programmatically and text

Can be obtained as a character string

Is possible!

This time, it ’s a fee.
Nori's Wakaroku looks down, a sentence
I would like to introduce a character recognition method.
I will.

--Furthermore, referring to "[Pyocr + Tesseract OCR] Printing of horse racing newspaper; Accuracy improvement ♬ (Related article 4.)", the size of the image data Expand --As a result, at least blank lines were removed

C:\Users\xxx\work>python ocr_card.py test_data_3_mod.png
It is possible to analyze the character string described on the image with a program and acquire only the text as a character string.!This time, I would like to introduce a talented character recognition method that pretends to be a Japanese-style charge and disguise it as a young illness.
The character string described on the image
Analyzed programmatically and text
Can be obtained as a character string
Is possible!

This time, the Japanese food fee is disguised as Jiàng.
Nori's young illness is a talented sentence
I would like to introduce a character recognition method.
I will.

--In addition, to understand ** Tesseract **, see "Hachiya Odake's Memorandum Blog (Related Article 5.) ] Will be helpful --In addition, to extract image data from a PDF file and save (convert) it, "How to convert PDF to image file (JPEG, PNG) with Python -pdf-to-image-by-python /) (Related article 6.) ”will be helpful.

3. 3. Consideration

--In Mincho font and Gothic font, the text in the image can be extracted correctly. --In other fonts, mis-extraction is conspicuous ――From this, it is considered that other Japanese fonts probably have no learning data and are not trained. ⇒ ** Therefore, by learning the fonts used in the original image, reduction of erroneous extraction can be expected ** --In addition, enlarging the image leads to the deletion of the blank line (erroneous extraction) that is the first execution result. ⇒ ** Therefore, by enlarging the image data as much as possible, it can be expected to further reduce erroneous extraction **

Four. Improved accuracy

So how do you improve the accuracy?

・ Create your own learning data and let it learn

The methods that can be considered amateurly are as follows. However, it is necessary to investigate the tool (this time, Tesseract) to see if this can be done. (1) In the case of handwriting ⇒ Prepare and train learning data that reflects the characteristics (habits) of handwritten characters in the original image. ・ "Tesseract 4.1 to relearn handwritten characters using LSTM (Related article 7.)" ・ "[23 listings] OCR (optical character recognition) / handwriting recognition dataset summary (Related article 8.) ”

(2) Prepare the font used in the original image as learning data and train it.

・ A method that can be done immediately

The following described in "[SikuliX] Three Ways to Improve OCR Japanese Reading Accuracy (Related Article 9.)" You can expect an improvement in accuracy by trying three (I think) (1) Enlarge and read the image to an appropriate font size (2) Prepare as high resolution images as possible (3) Set blacklist and whitelist

Understanding the OSS license (Do you know the difference between "use" and "use"?)
How to install Tesseract OCR on Windows
How to execute OCR in Python
[Pyocr + Tesseract OCR] Printing of horse racing newspaper; accuracy improvement ♬
Notes from Otake Hachiya Blog
How to convert PDF to image file (JPEG, PNG) with Python
Let Tesseract 4.1 relearn handwriting using LSTM
[23 listings] OCR (optical character recognition) / handwriting recognition dataset summary
[SikuliX] 3 ways to improve OCR Japanese reading accuracy 10.Documentation of Tesseract OCR 11.tesseract-ocr/tesseract
Character recognition with Python and Tesseract OCR
Let Tesseract 4.1 relearn Japanese using LSTM
Try to read sentences written in oracle bone script with OCR
Learning with the character recognition engine Tesseract OCR
[Create training data for Tesseract with jTessBoxEditor](https://nekodeki.com/jtessboxeditor%E3%81%A7tesseract%E3%81%AE%E5%AD%A6%E7%BF%92%E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% 92% E4% BD% 9C% E6% 88% 90% E3% 81% 99% E3% 82% 8B / # Tesseract)
Try simple OCR with Tesseract + PyOCR
Using Tesseract with PyOCR
[Try ocr by resizing the image of 10 rows and 10 columns of evenly spaced characters without ruled lines (Python + Tesseract)](http://chuckischarles.hatenablog.com/entry/2018/11/ 14/000952)
[How to use the tesseract command (Tesseract OCR 4.x)](https://blog.machine-powers.net/2018/08/02/learning-tesseract-command-utility/#%E3%83%9A% E3% 83% BC% E3% 82% B8% E3% 82% BB% E3% 82% B0% E3% 83% A1% E3% 83% B3% E3% 83% 86% E3% 83% BC% E3% 82% B7% E3% 83% A7% E3% 83% B3% E3% 83% A2% E3% 83% BC% E3% 83% 89-psm)
I tried playing with the option PSM of tesseract
Basic usage of Tesseract 4 written in Python. How to run OCR from API and CLI