I was able to quickly extract text data from an image file (scanning form data) using Tesseract, so I organized it as a memorandum.
-Tesseract is an OCR engine that also supports Japanese. --Open source ** License ** (Related article 1.) is "** Apache License 2.0 " and can be used for commercial purposes. -From ** Tesseract 4 , ** RNN ** (Recurrent Neural Network) has been extended to [ LSTM ** (Long short-term memory)](https://qiita.com/t_Signull/items/ A ** OCR engine ** ( AI-OCR **) based on 21b82be280b46f467d1b) is also installed, and by using this, extraction accuracy can be expected (I think). --Supported languages are from "** tesseract / doc / tesseract.1.asc **" When I counted it, there were 117 (as of July 25, 2020).
--First, see "** tesseract-ocr / tesseract **" --Among them, in the Window environment, from the older versions of Tesseract at UB Mannheim linked from the reference, the latest (v5) Download the installer for the stable version (tesseract-ocr-w64-setup-v5.0.0.20190623.exe) --For the subsequent work, refer to "How to install Tesseract OCR on Windows (Related article 2.)).
--In execution, as test data, the test image data introduced in [Excel Macro] Extract text in image with VBA + OCR Leverage --Before execution, refer to "How to execute OCR with Python (Related article 3.)" for execution, and python code. Create --The ** PyOCR ** package is required to use ** tesseract ** from python. ――By using this package, it is possible (likely) to improve the ** extraction accuracy of text ** (remove garbage). --The execution result is described below
C:\Users\xxx\work>python ocr_card.py test_data_3.png
It is possible to analyze the character string described on the image with a program and acquire only the text as a character string.!This time, I would like to introduce a character recognition method that disguises itself as a friend and looks at it.
The character string described on the image
Analyzed programmatically and text
Can be obtained as a character string
Is possible!
This time, it ’s a fee.
Nori's Wakaroku looks down, a sentence
I would like to introduce a character recognition method.
I will.
--Furthermore, referring to "[Pyocr + Tesseract OCR] Printing of horse racing newspaper; Accuracy improvement ♬ (Related article 4.)", the size of the image data Expand --As a result, at least blank lines were removed
C:\Users\xxx\work>python ocr_card.py test_data_3_mod.png
It is possible to analyze the character string described on the image with a program and acquire only the text as a character string.!This time, I would like to introduce a talented character recognition method that pretends to be a Japanese-style charge and disguise it as a young illness.
The character string described on the image
Analyzed programmatically and text
Can be obtained as a character string
Is possible!
This time, the Japanese food fee is disguised as Jiàng.
Nori's young illness is a talented sentence
I would like to introduce a character recognition method.
I will.
--In addition, to understand ** Tesseract **, see "Hachiya Odake's Memorandum Blog (Related Article 5.) ] Will be helpful --In addition, to extract image data from a PDF file and save (convert) it, "How to convert PDF to image file (JPEG, PNG) with Python -pdf-to-image-by-python /) (Related article 6.) ”will be helpful.
--In Mincho font and Gothic font, the text in the image can be extracted correctly. --In other fonts, mis-extraction is conspicuous ――From this, it is considered that other Japanese fonts probably have no learning data and are not trained. ⇒ ** Therefore, by learning the fonts used in the original image, reduction of erroneous extraction can be expected ** --In addition, enlarging the image leads to the deletion of the blank line (erroneous extraction) that is the first execution result. ⇒ ** Therefore, by enlarging the image data as much as possible, it can be expected to further reduce erroneous extraction **
So how do you improve the accuracy?
The methods that can be considered amateurly are as follows. However, it is necessary to investigate the tool (this time, Tesseract) to see if this can be done. (1) In the case of handwriting ⇒ Prepare and train learning data that reflects the characteristics (habits) of handwritten characters in the original image. ・ "Tesseract 4.1 to relearn handwritten characters using LSTM (Related article 7.)" ・ "[23 listings] OCR (optical character recognition) / handwriting recognition dataset summary (Related article 8.) ”
(2) Prepare the font used in the original image as learning data and train it.
The following described in "[SikuliX] Three Ways to Improve OCR Japanese Reading Accuracy (Related Article 9.)" You can expect an improvement in accuracy by trying three (I think) (1) Enlarge and read the image to an appropriate font size (2) Prepare as high resolution images as possible (3) Set blacklist and whitelist
Recommended Posts