[PYTHON] Convert PDF attached to email to text format

An email with a PDF attached is sent from a certain company.

It's also possible to use a mail service that can also search PDFs such as Gmail, but I didn't want to go out so much, and I wanted to handle it appropriately when a specific keyword was found, so I tried Python's pyPdf.

First. From email settings. Set the mail software you normally use to "Leave mail on the server" and fetch the mail remaining on the server with fetchmail and procmail. This time, I used the one in Cygwin environment and wrote the following configuration file.

.fetchmailrc


defaults
  fetchall
  keep
  mda "/usr/bin/procmail"

poll pop.xxx.com
  protocol pop3
  port 110
  username "XXXXX"
  password "XXXXX"

.procmailrc


MAILDIR=$HOME/Mail/
DEFAULT=/dev/null

:0 H
* ^From:.*[email protected]
/var/spool/mail/t.uehara/

By writing this, the mail from test @ example will be stored in / var / spool / mail, and other mail will be / dev / null (that is, discarded). Since it is written as keep on the fetchmail side, it is left on the server. The storage location can be anywhere, but it is convenient to read it with Mutt etc., so I set it to / var / spool / mail.

Then download pyPdf and do python setup.py install.

http://pybrary.net/pyPdf/

PyPdf sample program and Sample writing attachments using email package Copy .html) appropriately and write a PDF file and a script to generate a file with text extracted from the PDF file in an appropriate folder.

pdfmail.py


import os
import sys

import email
import mailbox
import mimetypes

import pyPdf

def pdfmail(msgfile):

    fp = open(msgfile)
    msg = email.message_from_file(fp)
    fp.close()

    counter = 1
    for part in msg.walk():
        if part.get_content_maintype() == 'multipart':
            continue
        fname = part.get_filename()
        if not fname:
            ext = mimetypes.guess_extension(part.get_type())
            if not ext:
                ext = '.bin'
            fname = 'part-%03d%s' % (counter, ext)
        counter += 1

        if fname.find('.pdf') != -1:

            print fname

            fp = open('pdf/'+fname, 'wb')
            fp.write(part.get_payload(decode=True))
            fp.close()

            c = getPDFContent('pdf/'+fname).encode("ascii","xmlcharrefreplace")
            fp = open('pdf/'+fname+".txt", 'wb')
            fp.write(c)
            fp.close()
def getPDFContent(path):
    content = ""
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

if __name__ == '__main__':

    maildir = '/var/spool/mail/t.uehara'
    m = mailbox.Maildir(maildir)
    for key in m.keys():
        pdfmail(maildir+'/new/'+key)

Finally, you can create a batch file like this and run it regularly with a Windows task scheduler.

pdfmail.bat


C:\cygwin\bin\bash --login -i -c "fetchmail"
C:\cygwin\bin\python2.7.exe pdfmail.py

Recommended Posts

Convert PDF attached to email to text format
[Python] Convert PDF text to CSV page by page (2/24 postscript)
Convert PDF to Documents by OCR
Convert markdown to PDF in Python
Convert A4 PDF to A3 every 2 pages
Convert from pdf to txt 2 [pyocr]
Convert PDF to image with ImageMagick
Convert a large number of PDF files to text files using pdfminer
Convert xml format data to txt format data (yolov3)
How to easily convert format from Markdown
Convert from PDF to CSV with pdfplumber
Convert PDF files to PNG files with GIMP
Convert Python date types to RFC822 format
How to convert DateTimeField format in Django
[Tensorflowjs_converter] How to convert Tensorflow model to Tensorflow.js format
Convert PDF to image (JPEG / PNG) with Python
Convert to HSV
Convert Mobile Suica usage history PDF to pandas Data Frame format with tabula-py
How to convert SVG to PDF and PNG [Python]
Convert multiple jpg files to one PDF file
Batch convert PSD files in directory to PDF
Convert json format data to txt (using yolo)
[Small story] Easy way to convert Jupyter to PDF
Convert binary packages for windows to wheel format
Select PDFMiner to extract text information from PDF
[Python] Continued-Convert PDF text to CSV page by page
Convert strings to character-by-character list format with python
Beginners try to convert Word files to PDF at once
Convert a text file with hexadecimal values to a binary file
Convert Qiita articles to Jekyll post format for backup
Convert the image in .zip to PDF with Python
Convert / return class object to JSON format in Python
Try to automate pdf format report creation with Python
How to convert Json file to CSV format or EXCEL format
Convert translation resource files (.po) to XLIFF format (.xlf)
Convert Pascal VOC format xml file to COCO format json file
Convert pdf to Text on the command line. No knowledge of Python required. About pdf2txt.py attached to pdfminer and adjustment parameters.
Convert jupyter to py
Convert keras-yolo3 to onnx
Convert dict to array
Convert json to excel
How to output a document in pdf format with Sphinx
Use pyOCR to convert the description on the card into text
Python / datetime> Implementation to convert YYYYMMDD format to YYYY / MM / DD
Linux script to convert Markdown files from JupyterLab format to Qiita format
Batch convert image files uploaded to MS Forms / Google Forms to PDF
Convert garbled scanned images to PDF with Pillow and PyPDF
[Caffe] Convert mean file from binary proto format to npy format
Script to convert between Xcode language files and tab-delimited text
Convert Excel file to text in Python for diff purposes