[PYTHON] It was a life I wanted to OCR on AWS Lambda to locate the characters.

Thing you want to do

--I want to detect the position of characters using OCR ――I don't use it so often, so I want to run it on Lambda --I want to use it from the Web

That's why I was able to do it

Character position identification with OCR on AWS Lambda

The repository is here

What is tesseract?

--Software that performs OCR --Enter with brew on Mac (v3.04) --Not only can you get the characters, but you can also output the position of the characters in hOCR (html) or tsv format <-Important

How do you run it on Lambda?

-Refer to StackOverflow ... --In Lambda, it works if you upload standalone binary files and .so properly. --subprocess (Python command line execution library) also works

In other words ... !!

By the way, if you don't build on Amazon Linux, Pillow (PIL) will witness the phenomenon that the fairy's neck is broken and die without ELF header.

How to detect the position of the character string?

――In general OCR, most of the characters are returned as text. -If you read Doc, v3.05 supports tsv format. --Tesseract (v3.04) will be included if you insuko normally --You have to build by hand according to StackOverflow to use v3.05

This was painful.

That's why I will write about how to introduce it.

Installation

All ec2-user is fine.

Installation of required packages

sudo yum install -y gcc gcc-c++ make
sudo yum install -y autoconf aclocal automake
sudo yum install -y libtool
sudo yum install -y libjpeg-devel libpng-devel libtiff-devel zlib-devel
sudo yum install -y git

nvm installation

In Amazon Linux, the version of node that was put in yum is too old and it is hard to do various things (described later), so put nvm in it. However, if it is not Amazon Linux, you will get an error in Lambda when you build it, so let's do our best.

$ curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.0/install.sh | bash
$ source ~/.bashrc 

$ nvm install v6.9.4  
$ nvm alias default v6.9.4  

#Check version
$ npm -v
$ node -v

Install Leptonica

Leptonica is Necessary to run tesseract with OSS that does image analysis You can't use tesseract v3.05 without raising the version here

$ cd ~
$ mkdir leptonica
$ cd leptonica

$ wget http://www.leptonica.com/source/leptonica-1.74.tar.gz

# unzip
$ tar -zxvf leptonica-1.73.tar.gz
$ cd leptonica-1.73

# build
$ ./configure
$ make
$ sudo make install

Install Tesseract

Since v3.05 is dev yet, it is not on the release == zip is not dropped, so I will clone it and do my best.

$ cd ~
$ git clone https://github.com/tesseract-ocr/tesseract.git
$ cd tesseract/
$ git checkout -b 3.05 origin/3.05

# initialize
$ ./autogen.sh

# build
$ ./configure
$ make
$ sudo make install

Packaging for Lambda

$ cd ~
$ mkdir package
$ cd package

# Copy libraries
$ cp /usr/local/bin/tesseract .
$ mkdir lib
$ cd lib
$ cp /usr/local/lib/libtesseract.so.3 .
$ cp /usr/local/lib/liblept.so.5 .
$ cp /lib64/librt.so.1 .
$ cp /lib64/libz.so.1 .
$ cp /usr/lib64/libpng12.so.0 .
$ cp /usr/lib64/libjpeg.so.62 .
$ cp /usr/lib64/libtiff.so.5 .
$ cp /lib64/libpthread.so.0 .
$ cp /usr/lib64/libstdc++.so.6 .
$ cp /lib64/libm.so.6 .
$ cp /lib64/libgcc_s.so.1 .
$ cp /lib64/libc.so.6 .
$ cp /lib64/ld-linux-x86-64.so.2 .
$ cp /usr/lib64/libjbig.so.2.0 .

# Get trained data
$ cd ..
$ mkdir tessdata
$ cd tessdata
$ wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
$ wget https://github.com/tesseract-ocr/tessdata/raw/master/osd.traineddata

# Make config file
$ mkdir configs
$ echo 'tessedit_create_tsv 1' > tsv

$ cd ../..
$ zip -r package.zip package

Now you can use it by enclosing package in your Lambda package!

As a result of trying it, wwwwww

I'm sorry for the grass.

image.png

This is the result of giving such an image

level	page_num	block_num	par_num	line_num	word_num	left	top	width	height	conf	text
1	1	0	0	0	0	0	0	1080	1920	-1	
2	1	1	0	0	0	29	11	1025	50	-1	
3	1	1	1	0	0	29	11	1025	50	-1	
4	1	1	1	1	0	29	11	1025	50	-1	
5	1	1	1	1	1	29	11	548	50	60	GnAflQflAA
5	1	1	1	1	2	640	15	167	43	58	X-IIZII"
5	1	1	1	1	3	899	14	155	44	89	l11:57
2	1	2	0	0	0	0	0	1080	76	-1	
3	1	2	1	0	0	0	0	1080	76	-1	
4	1	2	1	1	0	0	0	1080	76	-1	
5	1	2	1	1	1	0	0	1080	76	95	 
2	1	3	0	0	0	192	829	197	66	-1	
3	1	3	1	0	0	192	829	197	66	-1	
4	1	3	1	1	0	192	829	197	66	-1	
5	1	3	1	1	1	192	851	93	44	87	00
5	1	3	1	1	2	336	829	53	66	71	la
2	1	4	0	0	0	122	992	718	109	-1	
3	1	4	1	0	0	122	992	718	109	-1	
4	1	4	1	1	0	122	992	718	47	-1	
5	1	4	1	1	1	122	995	88	44	89	Sign
5	1	4	1	1	2	229	995	31	34	94	in
5	1	4	1	1	3	276	997	40	32	86	to
5	1	4	1	1	4	332	997	64	42	89	get
5	1	4	1	1	5	410	993	66	36	86	the
5	1	4	1	1	6	493	997	104	32	84	most
5	1	4	1	1	7	613	997	66	32	86	out
5	1	4	1	1	8	695	992	41	37	91	of
5	1	4	1	1	9	749	1003	91	36	93	your
4	1	4	1	2	0	122	1065	144	36	-1	
5	1	4	1	2	1	122	1065	144	36	87	device.
2	1	5	0	0	0	124	1269	312	46	-1	
3	1	5	1	0	0	124	1269	312	46	-1	
4	1	5	1	1	0	124	1269	312	46	-1	
5	1	5	1	1	1	124	1269	111	36	87	Email
5	1	5	1	1	2	253	1279	40	26	92	or
5	1	5	1	1	3	310	1269	126	46	89	phone

The source is like this

import requirements

from PIL import Image
import sys
import pyocr
import pyocr.builders

import urllib
import os
import subprocess
import base64
import json
import boto3

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')
LANG_DIR = os.path.join(SCRIPT_DIR, 'tessdata')

def response(code, body):
    return {
        'statusCode': code,
        'headers': {
            'Access-Control-Allow-Origin': '*',
        },
        'body': json.dumps(body),
    }

def handler(event, context):
    # Get the bucket and object from the event
    try:
        tools = pyocr.get_available_tools()
        if len(tools) == 0:
            print("No OCR tool found")
            sys.exit(1)
        tool = tools[0]
        print("Will use tool '%s'" % (tool.get_name()))

        request = event['body']

        result_filepath = '/tmp/result'
        img_filepath = '/tmp/image.png'
        with open(img_filepath, 'wb') as fh:
            fh.write(base64.decodestring(request['template']))

        command = 'LD_LIBRARY={} TESSDATA_PREFIX={} {}/tesseract {} {} -l eng --oem 0  tsv'.format(
            LIB_DIR,
            SCRIPT_DIR,
            SCRIPT_DIR,
            img_filepath,
            result_filepath
        )
        print command

        try:
            output = subprocess.check_output(
                command,
                shell=True,
                stderr=subprocess.STDOUT
            )
            print(output)

            with open(result_filepath + '.tsv', 'rb') as fh:
                print(fh.read())
        except subprocess.CalledProcessError as e:
            return "except:: " + e.output

    except Exception as e:
        print(e)
        raise e

After that, please feel free to rewrite serverless.yml on GitHub or whatever.

Recommended Posts

It was a life I wanted to OCR on AWS Lambda to locate the characters.
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments
Summary of points I was addicted to running Selenium on AWS Lambda (python)
I tried to use Twitter Scraper on AWS Lambda and it didn't work.
I want to AWS Lambda with Python on Mac!
I was able to repeat it in Python: lambda
I wanted to operate google spread sheet with AWS lambda, so I tried it [Part 2]
I tried to launch ipython cluster to the minimum on AWS
Life game with Python [I made it] (on the terminal & Tkinter)
The record I was addicted to when putting MeCab on Heroku
How easy is it to synthesize a drug on the market?
Since the Excel date read by pandas.read_excel was a serial number, I converted it to datetime.datetime
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
A note I was addicted to when making a beep on Linux
I tried to register a station on the IoT platform "Rimotte"
I want to create a histogram and overlay the normal distribution curve on it. matplotlib edition
I wanted to solve the ABC164 A ~ D problem with Python
[AWS / Tello] Build a system to operate the drone on the cloud
I made a POST script to create an issue on Github and register it in the Project
I made a bot to post on twitter by web scraping a dynamic site with AWS Lambda (continued)
Use AWS lambda to scrape the news and notify LINE of updates on a regular basis [python]
Upload data to s3 of aws with a command and update it, and delete the used data (on the way)
A story I was addicted to trying to install LightFM on Amazon Linux
Matching karaoke keys ~ I tried to put it on Laravel ~ <on the way>
It is convenient to use Layers when putting a library on Lambda
I was addicted to Flask on dotCloud
I wanted to use the find module of Ansible2, but it took some time, so make a note
In IPython, when I tried to see the value, it was a generator, so I came up with it when I was frustrated.
I wanted to know the number of lines in multiple files, so I tried to get it with a command
When I tried to make a VPC with AWS CDK but couldn't make it
P100-PCIE-16GB was added to the GPU of Google Colab before I knew it
I tried to use Resultoon on Mac + AVT-C875, but I was frustrated on the way.
Use dHash to locate on the course from a scene in a racing game
I want to set a life cycle in the task definition of ECS
I tried to reduce costs by starting / stopping EC2 collectively on AWS Lambda
I tried to rescue the data of the laptop by booting it on Ubuntu
I made a program to look up words on the window (previous development)
I set up TensowFlow and was addicted to it, so make a note
Periodically run a python program on AWS Lambda
[Introduction to json] No, I was addicted to it. .. .. ♬
I was able to recurse in Python: lambda
[Introduction to AWS] The first Lambda is Transcribe ♪
I wanted to play with the Bezier curve
How to live a decent life on 2017 Windows
I did a little research on the class
The file name saved by pysheng was a hexadecimal number, so I fixed it.
I want to take a screenshot of the site on Docker using any font
Try running a Schedule to start and stop an instance on AWS Lambda (Python)
What I did when I was angry to put it in with the enable-shared option
It was a little difficult to do flask with the docker version of nginx-unit
A beginner tried coloring line art with chainer. I was able to do it.
A little trick to know when writing a Twilio application using Python on AWS Lambda
I tried to make it easy to change the setting of authenticated Proxy on Jupyter
It is difficult to install a green screen, so I cut out only the face and superimposed it on the background image
As an AWS Professional, I was reincarnated about Spark being a demon world running on K8s (1) -I still do it on Mac-
Let's use AWS Lambda to create a mechanism to notify slack when the value monitored by CloudWatch is exceeded on Python
I want to find a popular package on PyPi
How to set layer on Lambda using AWS SAM
I tried to get an AMI using AWS Lambda
A story that I was addicted to at np.where
Procedure for creating a Line Bot on AWS Lambda