[Python] Convert PDF text to CSV page by page (2/24 postscript)

Reference: Extract Japanese text from PDF with PDFMiner

This is almost the method. I haven't done anything interesting.

What to use

A library called PDFMiner. It is one shot with pip.

pip install pdfminer.six

On the reference site, there was Japanese, but even if I put it in with pip, Japanese was detected properly.

CSV to make

-CSV creation date data is included in the "Update date" column. -PDF text data is included in the "Sentence" column -The PDF page number is entered in the "Page number" column.

What was made

This is the source of the 90% reference site.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

import csv,re,datetime
import pandas as pd

class converter(object):
  def pdf_to_csv(self,p_d_f):
    df = pd.DataFrame(columns=["Update date and time","Sentence","page number"])
  
    #PDF text extraction from here
    cnt = 1
    space = re.compile("[  ]+")
    fp = open(p_d_f, 'rb')
        
    for page in PDFPage.get_pages(fp):
      #Sequential initialization
      rsrcmgr = PDFResourceManager()
      outfp = StringIO()
      codec = 'utf-8'
      laparams = LAParams()
      laparams.detect_vertical = True
      device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
      interpreter = PDFPageInterpreter(rsrcmgr, device)
            
      interpreter.process_page(page)
      text = re.sub(space, "", outfp.getvalue())

      df.loc[cnt,["Sentence","page number"]] = [text,cnt]
      cnt += 1
            
    outfp.close()
    fp.close()
    device.close()
         
    now = datetime.datetime.now()
    df["Update date and time"] = now

    csv_path = p_d_f.replace('.pdf', '.csv')
    df.to_csv(csv_path, encoding='CP932', index=False)

if __name__ == "__main__":
       
  p_d_f = "Somehow.pdf"
  con=converter()
  hoge=con.pdf_to_csv(p_d_f)

The difference from the reference site is that the box (outfp) that stores the text data extracted from the PDF is initialized at the point where it is put in the data frame. If it is left as it is, the text data of all pages will be added more and more. If you put it in the data frame, it will be this one, so I wonder if you can quickly add small columns.

It may be because it is easy that csv conversion was not caught in one shot even if I searched, but for writing notes.

2/24 postscript

Continued for some reason

Recommended Posts

[Python] Convert PDF text to CSV page by page (2/24 postscript)
[Python] Continued-Convert PDF text to CSV page by page
Convert markdown to PDF in Python
[Python] Convert csv file delimiters to tab delimiters
Convert from PDF to CSV with pdfplumber
Convert PDF attached to email to text format
[Good By Excel] python script to generate sql to convert csv to table
I want to convert a table converted to PDF in Python back to CSV
Convert PDF to image (JPEG / PNG) with Python
[Python] Convert from DICOM to PNG or CSV
How to convert SVG to PDF and PNG [Python]
Convert the image in .zip to PDF with Python
How to convert JSON file to CSV file with Python Pandas
How to add page numbers to PDF files (in Python)
How to save a table scraped by python to csv
[python] Convert date to string
Convert numpy int64 to python int
[Python] Convert list to Pandas [Pandas]
Convert HTML to text file
Convert Scratch project to Python
[Python] Convert Shift_JIS to UTF-8
Speech to speech in python [text to speech]
Write to csv with Python
Convert SDF to CSV quickly
Convert python 3.x code to python 2.x
Convert Excel file to text in Python for diff purposes
[Python] Convert CSV file uploaded to S3 to JSON file with AWS Lambda
Join csv normalized by Python pandas to make it easier to check
[Python] Write to csv file with Python
Convert A4 PDF to A3 every 2 pages
Convert files written in python etc. to pdf with syntax highlighting
Convert list to DataFrame with python
After calling the Shell file on Python, convert CSV to Parquet.
Convert PDF of Go To Eat Hokkaido campaign dealer list to CSV
Python> list> Convert double list to single list
Convert from pdf to txt 2 [pyocr]
Convert a large number of PDF files to text files using pdfminer
[Python] Convert natural numbers to ordinal numbers
Convert decimal numbers to n-ary numbers [python]
Convert PDF to image with ImageMagick
Python> tuple> Convert double tuple to single tuple
Convert XML document stored in XML database (BaseX) to CSV format (using Python)
Convert PDF of Kumamoto Prefecture Go To EAT member store list to CSV
Read CSV file with Python and convert it to DataFrame as it is
Convert PDF of Go To EAT member stores in Ishikawa prefecture to CSV
Convert PDF of new corona outbreak case in Aichi prefecture to CSV
Preprocessing with Python. Convert Nico Nico Douga tag search results to CSV format
English speech recognition with python [speech to text]
Convert memo at once with Python 2to3
Reintroduction to Python Decorators ~ Learn Decorators by Type ~
Convert Python> two value sequence to dictionary
Answer to AtCoder Beginners Selection by Python3
[Python] How to convert a 2D list to a 1D list
How to convert csv to tsv in CLI
How to convert Python to an exe file
Convert Hiragana to Romaji with Python (Beta)
Convert from katakana to vowel kana [python]
Function to save images by date [python3]
Convert FX 1-minute data to 5-minute data with Python
python> Convert tuple to list> aList = list (pi_tuple)
Read Python csv and export to txt