[LINUX] Compile Tesseract for Tess4J to transcribe from images using CentOS

Things necessary

VirtualBox (after all, java is wonderful.) Disk image of CentOS6.3 (iso)

Virtual environment

procedure

Set VM

You will be asked if you want to test the build first, but skip the test. For some reason I can't move on. After that, proceed as it is. Install with the Desktop version.

Internet settings

At first, the net is not connected, so set it. Start the terminal and switch to superuser

su - 

Play with the following files with vi. (Insert mode with i, escape with esc, overwrite with: wq!)

vi /etc/sysconfig/network-scripts/ifcfg-eth0

ONBOOT = yes Changed #no to yes

Restart the service.

service netowrk restart

Check if FireFox can be connected, and if it is connected, it's OK. (Depending on the hardware, it may be necessary to install a driver.)

Other updates

Start the terminal, Superuser switching

su - 

update

yum update

Upgrade java to 1.8 (in devel version)

#Because it also passes to javac.
yum install java-1.8.0-openjdk-devel

Eclipse Neon Install eclipse neon (version error in drawing library if it is more than this) https://www.eclipse.org/downloads/packages/release/neon/3

I'm assuming you're using tess4j in your Maven project.

Install development tools and compile tesseract

Development tools (gcc requires 4.7 or above)

#Development tools
yum -y groupinstall "development tools"
#Peripheral library
yum -y install libpng-devel libtiff-devel libjpeg-devel
#Tools needed for compilation
yum -y install centos-release-scl
#compiler
yum -y install devtoolset-7-gcc-c++

Enable the inscored environment

This needs to be done in the terminal every time unless set in the config file.

source /opt/rh/devtoolset-7/enable

Installation of tools required for compilation

cd /usr/src/
wget http://ftpmirror.gnu.org/autoconf-archive/autoconf-archive-2019.01.06.tar.xz
tar xvvfJ autoconf-archive-2019.01.06.tar.xz
cd autoconf-archive-2019.01.06/
./configure --prefix=/usr
make
make install

Compile and install Tesseract's image processing program.

cd /usr/src/
wget http://leptonica.org/source/leptonica-1.77.0.tar.gz
tar xvvfz leptonica-1.77.0.tar.gz
cd leptonica-1.77.0/
./configure --prefix=/usr/local/
make
make install

Compile and install Tesseract

This time 4.1.1-rc2

cd /usr/src/
wget https://github.com/tesseract-ocr/tesseract/archive/4.1.1-rc2.tar.gz
tar xvvfz 4.1.1-rc2.tar.gz #For some reason, the tesseract name is missing lol. Only this version.
cd tesseract-4.1.1-rc2 
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure --prefix=/usr/local/ --with-extra-libraries=/usr/local/lib/ --disable-openmp
make install

When it's done successfully, it's all in / usr / local / lib / Create a "linux-x86-64" folder anywhere you like Copy in this. cp file file... dir

Copy this directly under the src / main / resources folder of the project that uses tesseract of eclipse

Launch eclipse from terminal

First, somehow set the locale. (https://github.com/nguyenq/tess4j/issues/105)

export LC_ALL=C

In this state, type the startup file path of eclipse into the terminal and start it.

Software release

If you can compile your own java program, Includes jar, tessdata folder, and compiled "linux-x86-64" folder.

When testing in a terminal in a similar environment export LC_ALL = C, then command.

that's all.

Where I got stuck

-Add an option at compile time to avoid openmp linker errors. (Https://github.com/tesseract-ocr/tesseract/issues/2323) -Create a linux-x86-64 folder, put all the compiled files in it, and copy it to the resource folder instead of directly under the project folder. -Every time you compile the tesseract library (with the above method), you need "source / opt / rh / devtoolset-7 / enable" in the terminal. -After using the locale command to "export LC_ALL = C" (* you can record it in the configuration file), start eclipse in the terminal as it is.

Reference

Visionary Imaging Services, Inc. Tatsuaki Kobayashi

Recommended Posts

Compile Tesseract for Tess4J to transcribe from images using CentOS
Convert from Pandas DataFrame to System.Data.DataTable using Python for .NET
Switch from python2.7 to python3.6 (centos7)
Download images from "Irasutoya" using Scrapy
Geotag prediction from images using DNN
Post images from Python to Tumblr
From Python to using MeCab (and CaboCha)
Connecting from python to MySQL on CentOS 6.4
Python> Output numbers from 1 to 100, 501 to 600> For csv