[Python] Get the numbers in the graph image with OCR

Purpose

I want to calculate the difference number from the graph image of the pachislot data site.

At that time, since the number of sheets indicated on the graph image was required, the indicated number of sheets is acquired by OCR.

20200713_p-bbnippori_126.png

Such a graph image.

What you want to get is the number shown in the upper left (2410 in the case of this image)

What to prepare

・ Tesseract (4.0 or later) ・ PyOCR

Installation method etc. are omitted. A reference link is posted at the bottom of the page, so please use that.

Try OCR

20200713_p-bbnippori_126.png

For the time being, read this graph image as it is.

from PIL import Image
import pyocr
import pyocr.builders
import sys

file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
    print('I can't find pyocr. Please install pyocr.')
    sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
# OCR
max_medals = tool.image_to_string(img_org, lang='jpn', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals:{max_medals}')

Execution result


-

I couldn't get any numbers.

After investigating various things, it seems that it is more accurate to do numerical OCR with an English dataset, so I changed the language setting to English.

Change language setting to English

from PIL import Image
import pyocr
import pyocr.builders
import sys

file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
    print('I can't find pyocr. Please install pyocr.')
    sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
# OCR
max_medals = tool.image_to_string(img_org, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals:{max_medals}')

Execution result


2410 1300160019.00

This time, I was able to get some of the numbers shown.

However, since I have read the unnecessary parts, I rewrite it so that only the part I want to read is cut out and then processed.

OCR after cutting out the reading point

from PIL import Image
import pyocr
import pyocr.builders
import sys

file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
    print('I can't find pyocr. Please install pyocr.')
    sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
#Cut out the number notation part
max_medals_img = img_org.crop((0, 0, 45, 15))
# OCR
max_medals = tool.image_to_string(max_medals_img , lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals:{max_medals}')

Execution result


max_medals:2410

It went well!

upgrade accuracy

Since it worked well with the previous code, I increased the number of graph images to be read and tried again.

from PIL import Image
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 45, 15))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
    print(f'max_medals:{max_medals}')

Execution result


max_medals:2410
max_medals:
max_medals:490
max_medals:2717
max_medals:689
max_medals:504
max_medals:1013
max_medals:
max_medals:862
max_medals:979
max_medals:835
max_medals:1683
max_medals:1587
max_medals:1010
max_medals:7
max_medals:1586
max_medals:1653
max_medals:413
max_medals:1167
max_medals:527

Some images were not read properly.

I have tried OCR with an image of another format before, and at that time I did not get any error, but the format at that time is

"Background color: white, text color: black"

Since it was an image of the format, I tried to reverse the background color and text color.

Invert background color and text color

from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 45, 15))
    #Invert background color and text color (convert from white text to black text)
    max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
    print(f'max_medals:{max_medals}')

Execution result


max_medals:2410
max_medals:440
max_medals:490
max_medals:2717
max_medals:689
max_medals:504
max_medals:1013
max_medals:791
max_medals:862
max_medals:979
max_medals:835
max_medals:1683
max_medals:1587
max_medals:1010
max_medals:1132
max_medals:1586
max_medals:1653
max_medals:413
max_medals:1167
max_medals:527

Images that could not be recognized normally were also recognized normally.

I tried to increase the number of image reading samples with this code ...

Execution result


max_medals:1908.
max_medals:
max_medals:1000-
max_medals:10

There are still rare cases where characters that are not written in this way are mixed in, the number of digits is incorrect, or the numbers cannot be recognized in the first place.

(7 out of 10,000)

Changed the OCR mode to further improve accuracy.

Change mode from 6 to 8 (mode that regards images as words)

from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 45, 15))
    #Invert background color and text color (convert from white text to black text)
    max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
    print(f'max_medals:{max_medals}')

Changed to a mode in which one image itself is regarded as a word. (This mode should be optimal because OCR is performed after cutting only the number notation part) This mode is more accurate.

(Reduced to about 4 out of 10,000)

However, since there were cases where it was not recognized normally, I added a code to exclude characters other than numerical values for the time being.

Exclude non-numeric characters

import re
from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 45, 15))
    #Invert background color and text color (convert from white text to black text)
    max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
    #Remove non-numeric characters
    max_medals = re.sub(r'\D', '', max_medals)
    print(f'max_medals:{max_medals}')

This avoids the case where characters other than unmarked numbers such as "-" and "." Are mixed in.

However, in rare cases, the numerical value itself could not be recognized or the number of digits was incorrect.

I wonder what to do and devise various improvement measures

Get the number notation of both the upper left and lower left of the image ↓ Compare both ↓ Adopt a person who seems to be normal

I thought about some patterns of logic, but the code was long and complicated, so I will reconsider a little here.

** "In the first place, if you can improve the recognition accuracy in OCR, you don't have to write troublesome code." **

I came up with the idea that it was natural, and tried various changes in the number notation cutout size of OCR preprocessing.

Final code

import re
from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 44, 14))
    #Invert background color and text color (convert from white text to black text)
    max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
    #Remove non-numeric characters
    max_medals = re.sub(r'\D', '', max_medals)
    print(f'max_medals:{max_medals}')

After trying various sizes, the recognition rate of the graph image I had with this code became 100%!

The result was that it was better to find the best practice for the cut size than to think about the logic (laughs)

Conclusion

If you cannot recognize the characters well

** I doubt the image in the first place> Review the settings etc.> Adjust by adding another logic **

I think it will be harder to get hooked if you pack in this priority.

This time, even though it recognizes only numerical values, it is extremely accurate, OCR.

Reference link

Character recognition with Python and Tesseract OCR How to run OCR in Python

Recommended Posts

[Python] Get the numbers in the graph image with OCR
[Python] Get the files in a folder with Python
Determine the numbers in the image taken with the webcam
Tweet with image in Python
Convert the image in .zip to PDF with Python
Get the result in dict format with Python psycopg2
Testing with random numbers in Python
Get the desktop path in Python
Get the weather with Python requests
Get the weather with Python requests 2
Get the script path in Python
Get the desktop path in Python
Get the host name in Python
Get started with Python in Blender
Get additional data in LDAP with python
[Python] Set the graph range with matplotlib
[Python] Try to graph from the image of Ring Fit [OCR]
[Python] Get the variable name with str
Display Python 3 in the browser with MAMP
Get Started with TopCoder in Python (2020 Edition)
Easy image processing in Python with Pillow
Get the EDINET code list in Python
Read text in images with python OCR
How to get the date and time difference in seconds with python
Get and convert the current time in the system local timezone with python
Read the graph image with OpenCV and get the coordinates of the final point of the graph
I tried "smoothing" the image with Python + OpenCV
Get the weather in Osaka via WebAPI (python)
Load the network modeled with Rhinoceros in Python ③
I tried "differentiating" the image with Python + OpenCV
Get the caller of a function in Python
What is wheezy in the Docker Python image?
[Automation] Extract the table in PDF with Python
Get image URL using Flickr API in Python
Get the X Window System window title in Python
Detect folders with the same image in ImageHash
I tried "binarizing" the image with Python + OpenCV
Create an image with characters in python (Japanese)
Load the network modeled with Rhinoceros in Python ②
How to get the files in the [Python] folder
Get files, functions, line numbers running in python
Load the network modeled with Rhinoceros in Python ①
Graph drawing in python
Image format in Python
Get date in Python
Image processing with Python
Get date with python
Prime numbers in Python
Draw graph in python
Get information on the 100 most influential tech Twitter users in the world with python.
Get the stock price of a Japanese company with Python and make a graph
How to get a list of files in the same directory with python
How to get the number of digits in Python
[python] Get the list of classes defined in the module
Get standard output in real time with Python subprocess
Crawl the URL contained in the twitter tweet with python
Write letters in the card illustration with OpenCV python
Get the size (number of elements) of UnionFind in Python
Get the value selected in Selenium Python VBA pull-down
Get the operation status of JR West with Python
Get the URL of the HTTP redirect destination in Python