[PYTHON] Convert from pdf to txt 2 [pyocr]

Introduction

Last time, I used pdfminer to convert from pdf to txt. However, it didn't work because of the problem of the target pdf. This time we will work on solving the problem with pyocr.

Previous article https://qiita.com/ptxyasu/items/4180035bd0ccd789c858

Purpose

Extract text from pdf.

What was used

This time, I thought about extracting the text using pyocr. However, pyocr also converted from pdf to image using pdf2image to extract text from the image.

[pyocr] https://gitlab.gnome.org/World/OpenPaperwork/pyocr [pdf2image]https://github.com/Belval/pdf2image

The introduction of tesseract and pyocr was done by referring to the following articles. https://qiita.com/nabechi6011/items/3a367ca94dbd208efcc7 https://github.com/tesseract-ocr/tesseract/wiki

Process flow

  1. Please input pdf name followed by target pdf file name
  2. Convert pdf file to image
  3. Create a result directory to output the results
  4. Extract text from image with pyocr
  5. Output the result to the txt file in result

Is. The program is shown at the end of the article.

result

Input: object1.pdf 1.PNG

output

object1.txt


8 CLASSES AND OBJECT-ORIENTED PROGRAMMING

We now turn our attention to our last major topic related to writing programs in
Python: using classes to organize programs around modules and data
abstractions.

Classes can be used in many different ways. In this book we emphasize using
them in the context of object-oriented programming. The key to object-
oriented programming is thinking about objects as collections of both data and
the methods that operate on that data.

The ideas underlying object-oriented programming are about forty years old, and
have been widely accepted and practiced over the last twenty years or so, In the
mid-1970s people began to write articles explaining the benefits of this approach
to programming. About the same time, the programming languages SmallTalk
(at Xerox PARC) and CLU (at MIT) provided linguistic support for the ideas. But
it wasn’t until the arrival of C++ and Java that it really took off in practice.

We have been implicitly relying on object-oriented programming throughout
most of this book. Back in Section 2.1.1 we said “Objects are the core things
that Python programs manipulate. Every object has a type that defines the
kinds of things that programs can do with objects of that type.” Since Chapter
5, we have relied heavily upon built-in types such as list and dict and the
methods associated with those types. But just as the designers of a
programming language can build in only a small fraction of the useful functions,
they can only build in only a small fraction of the useful types. We have already
looked at a mechanism that allows programmers to define new functions; we
now look at a mechanism that allows programmers to define new types.

 

8.1

Abstract Data Types and Classes

The notion of an abstract data type is quite simple. An abstract data type is a
set of objects and the operations on those objects. These are bound together so
that one can pass an object from one part of a program to another, and in doing
so provide access not only to the data attributes of the object but also to
operations that make it easy to manipulate that data.

The specifications of those operations define an interface between the abstract
data type and the rest of the program. The interface defines the behavior of the
operations—what they do, but not how they do it. The interface thus provides
an abstraction barrier that isolates the rest of the program from the data
structures, algorithms, and code involved in providing a realization of the type
abstraction.

Programming is about managing complexity in a way that facilitates change.
There are two powerful mechanisms available for accomplishing this:
decomposition and abstraction. Decomposition creates structure in a program,
and abstraction suppresses detail. The key is to suppress the appropriate

In the 10th line, years or so, In the, became., But other than that, it was correct. Well, it's perfect.

What I couldn't do with the previous pdfminer can be done by using pyocr (tesseract)!

Since it was made with the input image this time, I think that text can be extracted from the pdf created by scanning the printed matter. Also this time, I plan to paste this extracted text into google translate, Next time, I want to use googletrans to programmatically change it to Japanese.

program

The program has been published on github https://github.com/ptxyasu/pdf2text

The following pdf2text_pyocr.py is executed. Change the input pdf file to image by convert_from_path. Then pass the images to pyocr_read one by one.

pdf2text_pyocr.py


  
from pdf2image import convert_from_path
from pyocr_read import pyocr_read

path = input("Please input pdf name\n")
images = convert_from_path(path)

i = 0
path,e = path.split(".")
pdf2read = pyocr_read(path)

for image in images:
    pdf2read.oneshot_read(image)
    i += 1

The following pyocr_read.py is called from the above pdf2text_pyocr. Init () determines the pyocr tool and creates a directory to store the results. It also determines the language to recognize. The languages displayed in "Available languages" can be selected. For example, eng for English and jpn for Japanese Then, extract the text from the image received from pdf2text_pyocr with pyocr and write the output to a file.

pyocr_read.py


import pyocr
import pyocr.builders
import os

class pyocr_read(object):
    def __init__(self,path):
        self.path = path
        tools = pyocr.get_available_tools()
        if len(tools) == 0:
            print("No OCR tool found")
            sys.exit(1)
        self.tool = tools[0]

        langs = self.tool.get_available_languages()
        print("Available languages: %s" % ", ".join(langs))
        self.lang = input("Please input language you want to recognize : ")

        if os.path.exists("./result") != True:
            os.mkdir("./result")
        return

    def oneshot_read(self,img):
        txt = self.tool.image_to_string(img, lang=self.lang, builder=pyocr.builders.TextBuilder())
        print(txt)
        file = open("./result/"+ self.path + ".txt",mode = "a",encoding = "utf-8")
        file.write(txt+"\n")

Recommended Posts

Convert from pdf to txt 2 [pyocr]
Conversion from pdf to txt 1 [pdfminer]
Convert from PDF to CSV with pdfplumber
Convert PDF to Documents by OCR
Convert markdown to PDF in Python
Convert A4 PDF to A3 every 2 pages
How to convert from .mgz to .nii.gz
Convert PDF to image with ImageMagick
Convert xml format data to txt format data (yolov3)
How to easily convert format from Markdown
Convert from katakana to vowel kana [python]
Convert PDF attached to email to text format
Convert PDF files to PNG files with GIMP
Convert from Markdown to HTML in Python
Convert to HSV
[Python] Convert from DICOM to PNG or CSV
How to convert SVG to PDF and PNG [Python]
Convert multiple jpg files to one PDF file
Batch convert PSD files in directory to PDF
Convert json format data to txt (using yolo)
[Small story] Easy way to convert Jupyter to PDF
Select PDFMiner to extract text information from PDF
Images created with matplotlib shift from dvi to pdf
Beginners try to convert Word files to PDF at once
Convert 202003 to 2020-03 with pandas
Changes from Python 3.0 to Python 3.5
Convert kanji to kana
Transition from WSL1 to WSL2
Convert jupyter to py
[Python] Convert PDF text to CSV page by page (2/24 postscript)
Convert keras-yolo3 to onnx
Convert dict to array
Convert json to excel
From editing to execution
Convert elements of numpy array from float to int
Convert Select query obtained from Postgre with Go to JSON
Use pyOCR to convert the description on the card into text
Convert color space from RGB to CIELAB with PIL (Pillow)
Linux script to convert Markdown files from JupyterLab format to Qiita format
Batch convert image files uploaded to MS Forms / Google Forms to PDF
Convert garbled scanned images to PDF with Pillow and PyPDF
[Caffe] Convert mean file from binary proto format to npy format
Convert pixiv to mp4 and download from pixiv using python's pixivpy
Convert DataFrame column names from Japanese to English using Googletrans
Convert hexadecimal string to binary
[python] Convert date to string
Post from Python to Slack
Convert numpy int64 to python int
[Python] Convert list to Pandas [Pandas]
Cheating from PHP to Python
Convert HTML to text file
Porting from argparse to hydra
Migrating from Chainer v1 to Chainer v2
OCR from PDF in Python
Add page number to PDF
Anaconda updated from 4.2.0 to 4.3.0 (python3.5 updated to python3.6)
Migrated from Flask-RESTPlus to Flask-RESTX
Update python-social-auth from 0.1.x to 0.2.x
Convert Scratch project to Python
[Python] Convert Shift_JIS to UTF-8
Migrate from requirements.txt to pipenv