Anyway, it was easy, so make a note. However, only when the characters are properly registered in the pdf. Not possible for photos.
Source https://github.com/WZBSocialScienceCenter/pdf2xml-viewer
pdftohtml -c -hidden -xml input.pdf output.xml
python -m SimpleHTTPServer 8080
http://127.0.0.1:8080/ Connect to and specify the file to Load
Extract the scanned page image and generate XML with OCR text in PDF pdftohtml Display textboxes and scan pages with pdf2xml-viewer Load the XML that describes the page and textbox Detects straight lines on scanned pages, finds and corrects page skews and rotations Discover a cluster of vertical lines to identify columns in a table Analyze the distribution of the y coordinate of the text box to find the row position in the table Create a grid of columns and lines Match the textbox to the grid and extract the table data to export it as an Excel and CSV file
It didn't work, such as cutting out a photo. I found an article that pdf2htmlEX is more accurate than pdftohtml, so I will also try pdf2htmlEX https://github.com/coolwanglu/pdf2htmlEX
brew install pdf2htmlEX
brew install ttfautohint
brew install xpdf
If there is no unicode conversion correspondence table, an error may occur. In the first place, should I convert it to something other than uni and use it? http://d.hatena.ne.jp/jeneshicc/20091122 http://www.atmarkit.co.jp/flinux/rensai/linuxtips/736pdffont.html to unicode http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf