[PYTHON] Convert a large number of PDF files to text files using pdfminer

Today's menu

I want to read Large amount of English pdf files, but I don't understand English words in the first place. For now, convert the pdf file to a text file, list the words, and memorize the frequently-used words in order from the top. I'm sure you can read it faster! I decided to believe.

That's why I decided to put a lot of English pdf files into a pot, boil them and convert them to text files. I feel like boiling a large amount of soba for the Wanko soba tournament.

Countertop environment

macOS Python3.6 anaconda

Foodstuff

A large number of pdf files that are difficult to digest

kitchenware

pdfminer ← Check the reference URL at the end of the installation method os re

It seems that pdfminer gives better results than PyPDF2.

What to expect as a cooking failure

Please note that it does not (probably) support Japanese sentences.

Today's pot

PdfToTextConverter.py


#! python3
# PdfToTextConverter.py
#Read the contents of a PDF file and output it as a text file

import os
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

pdf_folder_path = os.getcwd()			                #Get the path of the current folder
text_folder_path = os.getcwd() + '/' + 'text_folder'		#Notation of path is mac specification. For windows'/'To'\'Correct to.

os.makedirs(text_folder_path, exist_ok=True)
pdf_file_name = os.listdir(pdf_folder_path)

#name is a PDF file (ends.pdf) returns TRUE, otherwise FALSE is returned.
#This post was quoted and partially changed → http://qiita.com/korkewriya/items/72de38fc506ab37b4f2d
def pdf_checker(name):
	pdf_regex = re.compile(r'.+\.pdf')
	if pdf_regex.search(str(name)):
		return True
	else:
		return False

#Convert PDF to text file
def convert_pdf_to_txt(path, txtname, buf=True):
    rsrcmgr = PDFResourceManager()
    if buf:
        outfp = StringIO()
    else:
        outfp = file(txtname, 'w')
    codec = 'utf-8'
    laparams = LAParams()
    laparams.detect_vertical = True
    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)

    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
    fp.close()
    device.close()
    if buf:
        text = outfp.getvalue()
        make_new_text_file = open(text_folder_path + '/' + path + '.txt', 'w')
        make_new_text_file.write(text)
        make_new_text_file.close()
    outfp.close()

#Get the pdf file name in the folder and list it
for name in pdf_file_name:
	if pdf_checker(name):
		convert_pdf_to_txt(name, name + '.txt')		# pdf_Use checker and TRUE (end is.For pdf) proceed to conversion)
	else:
		pass									    #Pass if not a PDF file
	
	

Finished product

A large number of text files that are likely to cause stomach upset

Next cooking

Move a large number of text files to a bowl and extract about 500 frequently-used words. Remember the meaning of the word (I don't know if it's effective for reading English quickly)

References, reference URL

http://qiita.com/korkewriya/items/72de38fc506ab37b4f2d → The part that converts a pdf file to a text file is quoted (partially modified) from this article.

https://kusanohitoshi.blogspot.jp/2017/01/python3cstringiostringio.html → Refer to here for how to deal with StringIO import error.

"Let Python do the boring things" → How to use the os module

http://www.unixuser.org/%7Eeuske/python/pdfminer/index.html → pdfminer page

https://github.com/conda-forge/pdfminer-feedstock https://conda-forge.org/feedstocks → Refer to here for the installation procedure of pdfminer in the anaconda environment.

Recommended Posts

Convert a large number of PDF files to text files using pdfminer
Upload a large number of images to Wordpress
Organize a large number of files into folders
One-liner to create a large number of test files at once on Linux
Paste a large number of image files into PowerPoint [python-pptx]
Convert A4 PDF to A3 every 2 pages
TensorFlow To learn from a large number of images ... ~ (almost) solution ~
Connect a large number of videos together!
Convert PDF attached to email to text format
Convert PDF files to PNG files with GIMP
Use shutil to delete all folders with a small number of files
ETL processing for a large number of GTFS Realtime files (Python edition)
TensorFlow To learn from a large number of images ... (Unsolved problem) → 12/18 Solved
Convert data with shape (number of data, 1) to (number of data,) with numpy.
Convert "number" of excel date to python datetime
Convert voice to text using Azure Speech SDK
Convert multiple jpg files to one PDF file
Batch convert PSD files in directory to PDF
Select PDFMiner to extract text information from PDF
Create a web app that converts PDF to text using Flask and PyPDF2
Accelerate a large number of simple queries with MySQL
Beginners try to convert Word files to PDF at once
Convert a slice object to a list of index numbers
Convert a text file with hexadecimal values to a binary file
Consolidate a large number of CSV files in folders with python (data without header)
Sort large text files
A tool to follow posters with a large number of likes on instagram [25 minutes to 1 second]
Read a large amount of securities reports using COTOHA
Use API to mark a large number of unread emails in Gmail as read
[Python] Randomly generate a large number of English names
Convert pdf to Text on the command line. No knowledge of Python required. About pdf2txt.py attached to pdfminer and adjustment parameters.
[Command] Command to get a list of files containing double-byte characters
Batch convert image files uploaded to MS Forms / Google Forms to PDF
Scrapy-Redis is recommended for crawling a large number of domains
I tried to make a simple text editor using PyQt
[End of 2020] A memo to start using AWS CLI (Version 2)
Script to convert between Xcode language files and tab-delimited text
Convert HTML to text file
A memorandum of using eigen3
How to display a specified column of files in Linux (awk)
Sphinx extension to arbitrarily convert text in pre-processing of document generation
Executing a large number of Python3 Executor.submit may consume a lot of memory
I tried to get a database of horse racing using Pandas
Python: Introduction to Flask: Creating a number identification app using MNIST
Convert files written in python etc. to pdf with syntax highlighting
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
Convert PDF of Go To Eat Hokkaido campaign dealer list to CSV
I tried to get a list of AMI Names using Boto3
How to save only a part of a long video using OpenCV
Output search results of posts to a file using Mattermost API
How to create a large amount of test data in MySQL? ??
Create a command line tool to convert dollars to yen using Python
[TensorFlow 2.x compatible version] How to train a large amount of data using TFRecord & DataSet in TensorFlow (Keras)
I want to backtest a large number of exchange pairs and strategies at once with Python's backtesting.py
I want to solve the problem of memory leak when outputting a large number of images with Matplotlib
Function to convert Excel column to number
Convert markdown to PDF in Python
Conversion from pdf to txt 1 [pdfminer]
A memorandum of files under conf.d