[PYTHON] Extract Japanese text from PDF with PDFMiner

What I want to do

Is there any problem with the contents of a large number of PDFs (10,000 or more!)? I wanted to do a quick search (whether the file name matches the contents, etc.).

environment

Python 2.7 Windows7 64bit

Library to use

Use PDFMiner. Although the sample when using the command prompt is posted on the official website, For some reason, there was no information on how to import and use the library, so I'm a little confused.

Installation

Extract the file downloaded from the official and In the pdfminer-20140328 folder, this time it is Windows, so execute the following command.

mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install

If you do not install with this procedure, all Japanese will be displayed like (cid: 0000).

Actually extract the text

pdf2txt.py


# -*- coding: utf-8 -*-

import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

space = re.compile(ur"[  ]+")

def convert_pdf_to_txt(path, txtname, buf=True):
    rsrcmgr = PDFResourceManager()
    if buf:
        outfp = StringIO()
    else:
        outfp = file(txtname, 'w')
    codec = 'utf-8'
    laparams = LAParams()
    laparams.detect_vertical = True
    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
    fp.close()
    device.close()
    if buf:
        text = re.sub(space, "", outfp.getvalue())
        print text
    outfp.close()


convert_pdf_to_txt("TEST.pdf", "test.txt")

** Prepared PDF ** TEST.gif

result

old pond
Jump into the frog
The sound of water

Cocoa*Chocolate

Cocoa is the solid content of cocoa butter separated from cocoa mass.
Or the abbreviation for the powdered cocoa powder. Also, melt the cocoa powder
It is also used as an abbreviation for a savory beverage. As shown in the history below,
Until the separation of cocoa butter from chaos, the word cocoa
There is only pasty chocolate that is neither solid nor liquid
Was there.

laparams.detect_vertical seems to be an important parameter. For vertical PDF text or PDF with complex structure If this is not set to True, Japanese will be broken for each character and the structure will be output in a mess. Also, re.sub removes the obstructive space. Then, for the character string in the memory, just check the contents with the in operator! By the way, if you pass buf = False as an argument, it will be output as text.

Digression

Characters displayed in variant characters such as Tsuji and 逗 could not be converted to Japanese well. It will be displayed like (cid: 7711). I don't know the cid font yet, so study it.

By the way, in GhostScript text extraction, when I tried to extract font data that is not in Windows, the character code of Shift_JIS with & #; was forcibly output, so the characters were garbled (from Shift_JIS --Unicode correspondence table). Forcibly convert and respond). PyPDF2 couldn't convert Japanese well. (Official also says This will be refined in the future.)

The site I referred to

http://stackoverflow.com/questions/26748788/extraction-of-text-from-pdf-with-pdfminer-gives-multiple-copies

Recommended Posts

Extract Japanese text from PDF with PDFMiner
Select PDFMiner to extract text information from PDF
[python] Extract text from pdf and read characters aloud with Open-Jtalk
Extract text from PowerPoint with Python! (Compatible with tables)
Thorough capture PDF open data. PDF text analysis starting with PDFMiner.
Conversion from pdf to txt 1 [pdfminer]
Extract text from images in Python
Speak Japanese text with OpenJTalk + python
Convert from PDF to CSV with pdfplumber
Document classification with toch text from PyTorch
Extract lines that match the conditions from a text file with python
Speaking Japanese with OpenJtalk (reading a text file)
[Automation] Extract the table in PDF with Python
Japanese with matplotlib
Extract zip with Python (Japanese file name support)
Wav file generation from numeric text with python
Extract data from a web page with Python
Extract images and tables from pdf with python to reduce the burden of reporting
Python: Japanese text: Characteristic of utterance from word similarity
Extract components and callbacks from app.py with plotly Dash
Extract files from EC2 storage with the scp command
Python: Japanese text: Characteristic of utterance from word continuity
Output PDF with Django
Extract data from S3
Japanese input with pyautogui
Extracted text from image
Output PDF with WeasyPrint
Speaking Japanese with OpenJtalk
Text mining with Python-Scraping-
View PDF with fbterm
Pythonbrew with Sublime Text
Extract table from wikipedia
English PDF in Japanese
Extract EXIF with sips
Flow of extracting text in PDF with Cloud Vision API
OpenJTalk on Windows10 (Speak Japanese with Python from environment construction)
Learn Japanese text categories with tf-idf and Random Forest ~ [Tuning]
Extract template of EML file saved from Thunderbird with python3.7
[Python] Extract text data from XML data of 10GB or more.
Get Japanese stock price information from yahoo finance with pandas
Python: Extract file information from shared drive with Google Drive API