[PYTHON] English PDF in Japanese

I want to translate PDFs together

If you want to translate a PDF, the standard function of Google Translate comes to mind first, but there is a file size limit and it is a problem depending on the PDF. In Previous article, I hit the Google Translate API to translate the English text into Japanese. If the text you want to read is PDF, you can use Google Docs or Adobe Acrobat to extract the text, but there is a drawback that the number of steps is large. Here too, it seems that Python can do a series of work by using a library called PDF Miner. I have referred to the following articles very much. [PDF Miner] Extracting text from PDF

The script I created is below. https://github.com/KanikaniYou/translate_pdf

All PDFs in a certain folder are extracted, translated, and output as text files.

Since unnecessary character strings are taken after extraction with PDFMiner (character strings with the same symbol as "......", etc. It is in the table of contents, etc.), the extracted text files are put together as an intermediate file. I keep it so that I can remove unnecessary parts by hand.

The overall flow of translation work 0. quick start| Google Cloud Translation API Documentation | Google Cloud PlatformGet Google Translate API by referring to

  1. Extract text from PDF and save it as a text file (pdf_to_txt.py)
  2. Format the text and save it as a new line of text file (let_translatable.py)
  3. Translate English text into Japanese and save it as a text file (translate_en_jp.py)

It will be. As mentioned above, by manually looking at the text file after 1., you can check the text extracted by PDF Miner and extract only the necessary parts so that you do not have to hit the Google Translate API wastefully. I will.

environment

I think that Linux with Python3 system can be used. My environment is Cloud9 and Ubuntu 18.04.

pip install pdfminer.six

By the way, PDF Miner is very useful, but it seems that garbled characters are likely to occur when you want to extract Japanese etc. This time I will take out English, so I don't think there will be any problems so often.

Known bugs related to Japanese retrieval in PDF Miner: Still have issues with CID Characters # 39

git clone https://github.com/KanikaniYou/translate_pdf
cd translate_pdf

The file structure. (For the sake of explanation, I have already placed 10 PDFs that I want to translate.)

.
├── eng_txt
├── eng_txt_split
├── jpn_txt
├── let_translatable.py
├── pdf_source
│   ├── report_1.pdf
│   ├── report_10.pdf
│   ├── report_2.pdf
│   ├── report_3.pdf
│   ├── report_4.pdf
│   ├── report_5.pdf
│   ├── report_6.pdf
│   ├── report_7.pdf
│   ├── report_8.pdf
│   └── report_9.pdf
├── pdf_to_txt.py
└── translate_en_jp.py
スクリーンショット 2019-12-11 15.07.11.png

1. Extract text

pdf_to_txt.py


import sys

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage

import os
import re

def find_textboxes_recursively(layout_obj):
	if isinstance(layout_obj, LTTextBox):
		return [layout_obj]

	if isinstance(layout_obj, LTContainer):
		boxes = []
		for child in layout_obj:
			boxes.extend(find_textboxes_recursively(child))
		return boxes
	return[]
	
def pdf_read_controller(filepath):
	try:
		text_in_pdf = ""
			
		with open(filepath, 'rb') as f:

			for page in PDFPage.get_pages(f):
				try:
						
					interpreter.process_page(page)
					layout = device.get_result()
			
					boxes = find_textboxes_recursively(layout)
					boxes.sort(key=lambda b:(-b.y1, b.x0))
					
					text_in_page = ""
					for box in boxes:
						text_in_box = ""
						
						text_in_box += box.get_text().strip().strip(" ")
						
						text_in_box.rstrip("\n")
						text_in_box = re.sub(r'  ', " ", text_in_box)
			
						text_in_page += text_in_box
					text_in_pdf += text_in_page
				except Exception as e:
					print(e)
					
		return(text_in_pdf)
		
	except Exception as e:
		print(e)
		print("error: " + filepath)
		return("no-text")


def make_txtfile(folder_path,file_name,text='error'):
	if text != "no-text":
		with open(folder_path+"/"+file_name, mode='w') as f:
			f.write(text)
	

laparams = LAParams(detect_vertical=True)
resource_manager = PDFResourceManager()
device = PDFPageAggregator(resource_manager, laparams=laparams)
interpreter = PDFPageInterpreter(resource_manager, device)

if __name__ == '__main__':
	for file_name in os.listdir("pdf_source"):
		if file_name.endswith(".pdf"):
			print(file_name)
			text_in_page = pdf_read_controller("pdf_source/" + file_name)
			make_txtfile("eng_txt_split",file_name.rstrip("pdf")+"txt",text_in_page)

Read all pdf files under the folder "pdf_source" to create a text file and output it.

 $ python pdf_to_txt.py
report_3.pdf
report_7.pdf
report_2.pdf
report_1.pdf
unpack requires a buffer of 10 bytes
unpack requires a buffer of 8 bytes
report_5.pdf
report_9.pdf
report_8.pdf
unpack requires a buffer of 6 bytes
unpack requires a buffer of 6 bytes
unpack requires a buffer of 4 bytes
report_4.pdf
report_6.pdf
report_10.pdf

I'm getting some errors, but ignore them and create a text file. PDF is confusing. Similar error: [struct.error: unpack requires a string argument of length 16](https://stackoverflow.com/questions/40158637/struct-error-unpack-requires-a-string-argument-of-length -16)

1-2. Visual check of text (OK without)

スクリーンショット 2019-12-11 16.34.43.png

For example, some of the text contained these parts. I don't want to hit the Google Translate API in vain, so delete the parts you don't need.

2. Text formatting

let_translatable.py


import os

if __name__ == '__main__':
	for file_name in os.listdir("eng_txt_split"):
		if file_name.endswith(".txt"):
			print(file_name)
			text = ""
			with open("eng_txt_split/"+file_name) as f:
				l = f.readlines()
				for line in l:
					text += str(line).rstrip('\n')
				
			path_w = "eng_txt/" + file_name
			with open(path_w, mode='w') as f:
				f.write(text)

The text that comes out in PDF Miner is full of line breaks, and if you put it in Google Translate as it is, it does not seem to translate well. Therefore, create a new text file without line breaks and output it to the folder eng_txt.

$ python let_translatable.py
report_4.txt
report_10.txt
report_2.txt
report_6.txt
report_9.txt
report_5.txt
report_8.txt
report_7.txt
report_3.txt
report_1.txt

3. Translate from English to Japanese!

The resulting text is finally translated. For the contents, please refer to the above.

translate_en_jp.py


import requests
import json
import os
import re
import time

API_key = '<Enter your API key here>'
def post_text(text):
    url_items = 'https://www.googleapis.com/language/translate/v2'
    item_data = {
        'target': 'ja',
        'source': 'en',
        'q':text
    }
    response = requests.post('https://www.googleapis.com/language/translate/v2?key={}'.format(API_key), data=item_data)
    return response.text
    
def jsonConversion(jsonStr):
    data = json.loads(jsonStr)
    return data["data"]["translations"][0]["translatedText"]
    
def split_text(text):
    sen_list = text.split('.')
    
    to_google_sen = ""
    from_google = ""
    
    for index, sen in enumerate(sen_list[:-1]):
        to_google_sen += sen + '. '
        if len(to_google_sen)>1000:
            from_google += jsonConversion(post_text(to_google_sen)) +'\n'
            time.sleep(1)
            
            to_google_sen = ""
        if index == len(sen_list)-2:
            from_google += jsonConversion(post_text(to_google_sen))
            time.sleep(1)
    return from_google
        

if __name__ == '__main__':
	for file_name in os.listdir("eng_txt"):
		print("source: " + file_name)
		with open("eng_txt/"+file_name) as f:
		    s = f.read()
		    new_text = split_text(s)
		    path_w = "jpn_txt/" + file_name
		    with open(path_w, mode='w') as f:
			    f.write(new_text)
 $ python translate_en_jp.py
source: report_4.txt
source: report_10.txt
source: report_2.txt
source: report_6.txt
source: report_9.txt
source: report_5.txt
source: report_8.txt
source: report_7.txt
source: report_3.txt
source: report_1.txt

Long text will take some time.

Deliverables

The translated text file will be in the jpn_txt folder. スクリーンショット 2019-12-11 17.04.05.png

Now you don't have to worry about English PDF! However, the text output by this has no concept of layout, and I think that it may not be translated well between pages. Originally, it would be nice if we could handle that area, but it seems to be quite difficult. I hope you can use it when you want to read a lot of PDFs in Japanese.

Recommended Posts

English PDF in Japanese
Rasterize PDF in Python
Japanese output in Python
I wrote python in Japanese
OCR from PDF in Python
Try translating English PDF Part 1
Display Japanese in JSON file
I understand Python in Japanese!
Get Japanese synonyms in Python
Convert markdown to PDF in Python
Make matplotlib Japanese compatible in 3 minutes
"Exception stack overflow!" In PyMu PDF
How to handle Japanese in Python
Comparison of Japanese conversion module in Python3
Put Japanese fonts in images with Colaboratory
Extract Japanese text from PDF with PDFMiner
PDF output with Latex extension in Sphinx
R: Use Japanese instead of Japanese in scripts
Handles UTF-8 Japanese characters in Python's MySQLdb.
Enter Japanese comments in Blender's text editor