[PYTHON] I tried to extract the text in the image file using Tesseract of the OCR engine

background

I was able to quickly extract text data from an image file (scanning form data) using Tesseract, so I organized it as a memorandum.

-Tesseract is an OCR engine that also supports Japanese. --Open source ** License ** (Related article 1.) is "** Apache License 2.0 " and can be used for commercial purposes. -From ** Tesseract 4 , ** RNN ** (Recurrent Neural Network) has been extended to [ LSTM ** (Long short-term memory)](https://qiita.com/t_Signull/items/ A ** OCR engine ** ( AI-OCR **) based on 21b82be280b46f467d1b) is also installed, and by using this, extraction accuracy can be expected (I think). --Supported languages are from "** tesseract / doc / tesseract.1.asc **" When I counted it, there were 117 (as of July 25, 2020).

1. 1. Introduction

--First, see "** tesseract-ocr / tesseract **" --Among them, in the Window environment, from the older versions of Tesseract at UB Mannheim linked from the reference, the latest (v5) Download the installer for the stable version (tesseract-ocr-w64-setup-v5.0.0.20190623.exe) --For the subsequent work, refer to "How to install Tesseract OCR on Windows (Related article 2.)).

2. Run

--In execution, as test data, the test image data introduced in [Excel Macro] Extract text in image with VBA + OCR Leverage --Before execution, refer to "How to execute OCR with Python (Related article 3.)" for execution, and python code. Create --The ** PyOCR ** package is required to use ** tesseract ** from python. ――By using this package, it is possible (likely) to improve the ** extraction accuracy of text ** (remove garbage). --The execution result is described below

C:\Users\xxx\work>python ocr_card.py test_data_3.png
It is possible to analyze the character string described on the image with a program and acquire only the text as a character string.!This time, I would like to introduce a character recognition method that disguises itself as a friend and looks at it.
The character string described on the image
Analyzed programmatically and text

Can be obtained as a character string

Is possible!

This time, it ’s a fee.
Nori's Wakaroku looks down, a sentence
I would like to introduce a character recognition method.
I will.

--Furthermore, referring to "[Pyocr + Tesseract OCR] Printing of horse racing newspaper; Accuracy improvement ♬ (Related article 4.)", the size of the image data Expand --As a result, at least blank lines were removed

C:\Users\xxx\work>python ocr_card.py test_data_3_mod.png
It is possible to analyze the character string described on the image with a program and acquire only the text as a character string.!This time, I would like to introduce a talented character recognition method that pretends to be a Japanese-style charge and disguise it as a young illness.
The character string described on the image
Analyzed programmatically and text
Can be obtained as a character string
Is possible!

This time, the Japanese food fee is disguised as Jiàng.
Nori's young illness is a talented sentence
I would like to introduce a character recognition method.
I will.

--In addition, to understand ** Tesseract **, see "Hachiya Odake's Memorandum Blog (Related Article 5.) ] Will be helpful --In addition, to extract image data from a PDF file and save (convert) it, "How to convert PDF to image file (JPEG, PNG) with Python -pdf-to-image-by-python /) (Related article 6.) ”will be helpful.

3. 3. Consideration

--In Mincho font and Gothic font, the text in the image can be extracted correctly. --In other fonts, mis-extraction is conspicuous ――From this, it is considered that other Japanese fonts probably have no learning data and are not trained. ⇒ ** Therefore, by learning the fonts used in the original image, reduction of erroneous extraction can be expected ** --In addition, enlarging the image leads to the deletion of the blank line (erroneous extraction) that is the first execution result. ⇒ ** Therefore, by enlarging the image data as much as possible, it can be expected to further reduce erroneous extraction **

Four. Improved accuracy

So how do you improve the accuracy?

・ Create your own learning data and let it learn

The methods that can be considered amateurly are as follows. However, it is necessary to investigate the tool (this time, Tesseract) to see if this can be done. (1) In the case of handwriting ⇒ Prepare and train learning data that reflects the characteristics (habits) of handwritten characters in the original image. ・ "Tesseract 4.1 to relearn handwritten characters using LSTM (Related article 7.)" ・ "[23 listings] OCR (optical character recognition) / handwriting recognition dataset summary (Related article 8.) ”

(2) Prepare the font used in the original image as learning data and train it.

・ A method that can be done immediately

The following described in "[SikuliX] Three Ways to Improve OCR Japanese Reading Accuracy (Related Article 9.)" You can expect an improvement in accuracy by trying three (I think) (1) Enlarge and read the image to an appropriate font size (2) Prepare as high resolution images as possible (3) Set blacklist and whitelist

Related article

  1. Understanding the OSS license (Do you know the difference between "use" and "use"?)
  2. How to install Tesseract OCR on Windows
  3. How to execute OCR in Python
  4. [Pyocr + Tesseract OCR] Printing of horse racing newspaper; accuracy improvement ♬
  5. Notes from Otake Hachiya Blog
  6. How to convert PDF to image file (JPEG, PNG) with Python
  7. Let Tesseract 4.1 relearn handwriting using LSTM
  8. [23 listings] OCR (optical character recognition) / handwriting recognition dataset summary
  9. [SikuliX] 3 ways to improve OCR Japanese reading accuracy 10.Documentation of Tesseract OCR 11.tesseract-ocr/tesseract
  10. Character recognition with Python and Tesseract OCR
  11. Let Tesseract 4.1 relearn Japanese using LSTM
  12. Try to read sentences written in oracle bone script with OCR
  13. Learning with the character recognition engine Tesseract OCR
  14. [Create training data for Tesseract with jTessBoxEditor](https://nekodeki.com/jtessboxeditor%E3%81%A7tesseract%E3%81%AE%E5%AD%A6%E7%BF%92%E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% 92% E4% BD% 9C% E6% 88% 90% E3% 81% 99% E3% 82% 8B / # Tesseract)
  15. Try simple OCR with Tesseract + PyOCR
  16. Using Tesseract with PyOCR
  17. [Try ocr by resizing the image of 10 rows and 10 columns of evenly spaced characters without ruled lines (Python + Tesseract)](http://chuckischarles.hatenablog.com/entry/2018/11/ 14/000952)
  18. [How to use the tesseract command (Tesseract OCR 4.x)](https://blog.machine-powers.net/2018/08/02/learning-tesseract-command-utility/#%E3%83%9A% E3% 83% BC% E3% 82% B8% E3% 82% BB% E3% 82% B0% E3% 83% A1% E3% 83% B3% E3% 83% 86% E3% 83% BC% E3% 82% B7% E3% 83% A7% E3% 83% B3% E3% 83% A2% E3% 83% BC% E3% 83% 89-psm)
  19. I tried playing with the option PSM of tesseract
  20. Basic usage of Tesseract 4 written in Python. How to run OCR from API and CLI

Recommended Posts

I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to correct the keystone of the image
I tried using the image filter of OpenCV
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to get the batting results of Hachinai using image processing
I tried to extract and illustrate the stage of the story using COTOHA
Python OpenCV tried to display the image in text.
I tried to compress the image using machine learning
[Python] I tried to judge the member image of the idol group using Keras
I tried to find the entropy of the image with python
I tried to find the affine matrix in image alignment (feature point matching) using affine transformation.
I tried to build the SD boot image of LicheePi Nano
I tried to process the image in "pencil style" with OpenCV
I tried using Azure Speech to Text.
I tried to classify text using TensorFlow
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried to automatically extract the movements of PES players with software
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I made a program to check the size of a file in Python
I tried to display the altitude value of DTM in a graph
I tried the common story of using Deep Learning to predict the Nikkei 225
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried to touch the API of ebay
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to open the latest data of the Excel file managed by date in the folder with Python
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
I tried to automate the face hiding work of the coordination image for wear
I tried to extract features with SIFT of OpenCV
I tried to detect the iris from the camera image
I tried to summarize the basic form of GPLVM
I tried to touch the CSV file with Python
I tried to approximate the sin function using chainer
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
The background of the characters in the text image is overexposed to make it easier to read.
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to predict the victory or defeat of the Premier League using the Qore SDK
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I want to collect a lot of images, so I tried using "google image download"
I tried to sort out the objects from the image of the steak set meal-④ Clustering
I tried to put HULFT IoT (Agent) in the gateway Rooster of Sun Electronics
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I want to extract only pods with the specified label using Label Selector in Client-go
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried to simulate ad optimization using the bandit algorithm.
I tried face recognition of the laughter problem using Keras.
I tried to summarize the code often used in Pandas
I tried to illustrate the time and time in C language
[Python] I tried to visualize the follow relationship of Twitter