[PYTHON] Display character area from pdf pdf2xml-viewer

Anyway, it was easy, so make a note. However, only when the characters are properly registered in the pdf. Not possible for photos.

スクリーンショット 2017-03-15 14.52.21.png

Source https://github.com/WZBSocialScienceCenter/pdf2xml-viewer

Convert to xml

pdftohtml -c -hidden -xml input.pdf output.xml

display

python -m SimpleHTTPServer 8080

http://127.0.0.1:8080/ Connect to and specify the file to Load

Detailed algorithm

Extract the scanned page image and generate XML with OCR text in PDF pdftohtml Display textboxes and scan pages with pdf2xml-viewer Load the XML that describes the page and textbox Detects straight lines on scanned pages, finds and corrects page skews and rotations Discover a cluster of vertical lines to identify columns in a table Analyze the distribution of the y coordinate of the text box to find the row position in the table Create a grid of columns and lines Match the textbox to the grid and extract the table data to export it as an Excel and CSV file


Also try pdf2htmlEX

It didn't work, such as cutting out a photo. I found an article that pdf2htmlEX is more accurate than pdftohtml, so I will also try pdf2htmlEX https://github.com/coolwanglu/pdf2htmlEX

brew install pdf2htmlEX
brew install ttfautohint
brew install xpdf

If there is no unicode conversion correspondence table, an error may occur. In the first place, should I convert it to something other than uni and use it? http://d.hatena.ne.jp/jeneshicc/20091122 http://www.atmarkit.co.jp/flinux/rensai/linuxtips/736pdffont.html to unicode http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

Recommended Posts

Display character area from pdf pdf2xml-viewer
OCR from PDF in Python
BLAST result-like character string display