If you want to translate a PDF, the standard function of Google Translate comes to mind first, but there is a file size limit and it is a problem depending on the PDF. In Previous article, I hit the Google Translate API to translate the English text into Japanese. If the text you want to read is PDF, you can use Google Docs or Adobe Acrobat to extract the text, but there is a drawback that the number of steps is large. Here too, it seems that Python can do a series of work by using a library called PDF Miner. I have referred to the following articles very much. [PDF Miner] Extracting text from PDF
The script I created is below. https://github.com/KanikaniYou/translate_pdf
All PDFs in a certain folder are extracted, translated, and output as text files.
Since unnecessary character strings are taken after extraction with PDFMiner (character strings with the same symbol as "......", etc. It is in the table of contents, etc.), the extracted text files are put together as an intermediate file. I keep it so that I can remove unnecessary parts by hand.
The overall flow of translation work 0. quick start| Google Cloud Translation API Documentation | Google Cloud PlatformGet Google Translate API by referring to
It will be. As mentioned above, by manually looking at the text file after 1., you can check the text extracted by PDF Miner and extract only the necessary parts so that you do not have to hit the Google Translate API wastefully. I will.
I think that Linux with Python3 system can be used. My environment is Cloud9 and Ubuntu 18.04.
pip install pdfminer.six
By the way, PDF Miner is very useful, but it seems that garbled characters are likely to occur when you want to extract Japanese etc. This time I will take out English, so I don't think there will be any problems so often.
Known bugs related to Japanese retrieval in PDF Miner: Still have issues with CID Characters # 39
git clone https://github.com/KanikaniYou/translate_pdf
cd translate_pdf
The file structure. (For the sake of explanation, I have already placed 10 PDFs that I want to translate.)
├── eng_txt
├── eng_txt_split
├── jpn_txt
├── let_translatable.py
├── pdf_source
│ ├── report_1.pdf
│ ├── report_10.pdf
│ ├── report_2.pdf
│ ├── report_3.pdf
│ ├── report_4.pdf
│ ├── report_5.pdf
│ ├── report_6.pdf
│ ├── report_7.pdf
│ ├── report_8.pdf
│ └── report_9.pdf
├── pdf_to_txt.py
└── translate_en_jp.py
import sys
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage
import os
import re
def find_textboxes_recursively(layout_obj):
if isinstance(layout_obj, LTTextBox):
return [layout_obj]
if isinstance(layout_obj, LTContainer):
boxes = []
for child in layout_obj:
return boxes
def pdf_read_controller(filepath):
text_in_pdf = ""
with open(filepath, 'rb') as f:
for page in PDFPage.get_pages(f):
layout = device.get_result()
boxes = find_textboxes_recursively(layout)
boxes.sort(key=lambda b:(-b.y1, b.x0))
text_in_page = ""
for box in boxes:
text_in_box = ""
text_in_box += box.get_text().strip().strip(" ")
text_in_box = re.sub(r' ', " ", text_in_box)
text_in_page += text_in_box
text_in_pdf += text_in_page
except Exception as e:
except Exception as e:
print("error: " + filepath)
def make_txtfile(folder_path,file_name,text='error'):
if text != "no-text":
with open(folder_path+"/"+file_name, mode='w') as f:
laparams = LAParams(detect_vertical=True)
resource_manager = PDFResourceManager()
device = PDFPageAggregator(resource_manager, laparams=laparams)
interpreter = PDFPageInterpreter(resource_manager, device)
if __name__ == '__main__':
for file_name in os.listdir("pdf_source"):
if file_name.endswith(".pdf"):
text_in_page = pdf_read_controller("pdf_source/" + file_name)
Read all pdf files under the folder "pdf_source" to create a text file and output it.
$ python pdf_to_txt.py
unpack requires a buffer of 10 bytes
unpack requires a buffer of 8 bytes
unpack requires a buffer of 6 bytes
unpack requires a buffer of 6 bytes
unpack requires a buffer of 4 bytes
I'm getting some errors, but ignore them and create a text file. PDF is confusing. Similar error: [struct.error: unpack requires a string argument of length 16](https://stackoverflow.com/questions/40158637/struct-error-unpack-requires-a-string-argument-of-length -16)
For example, some of the text contained these parts. I don't want to hit the Google Translate API in vain, so delete the parts you don't need.
import os
if __name__ == '__main__':
for file_name in os.listdir("eng_txt_split"):
if file_name.endswith(".txt"):
text = ""
with open("eng_txt_split/"+file_name) as f:
l = f.readlines()
for line in l:
text += str(line).rstrip('\n')
path_w = "eng_txt/" + file_name
with open(path_w, mode='w') as f:
The text that comes out in PDF Miner is full of line breaks, and if you put it in Google Translate as it is, it does not seem to translate well. Therefore, create a new text file without line breaks and output it to the folder eng_txt.
$ python let_translatable.py
The resulting text is finally translated. For the contents, please refer to the above.
import requests
import json
import os
import re
import time
API_key = '<Enter your API key here>'
def post_text(text):
url_items = 'https://www.googleapis.com/language/translate/v2'
item_data = {
'target': 'ja',
'source': 'en',
response = requests.post('https://www.googleapis.com/language/translate/v2?key={}'.format(API_key), data=item_data)
return response.text
def jsonConversion(jsonStr):
data = json.loads(jsonStr)
return data["data"]["translations"][0]["translatedText"]
def split_text(text):
sen_list = text.split('.')
to_google_sen = ""
from_google = ""
for index, sen in enumerate(sen_list[:-1]):
to_google_sen += sen + '. '
if len(to_google_sen)>1000:
from_google += jsonConversion(post_text(to_google_sen)) +'\n'
to_google_sen = ""
if index == len(sen_list)-2:
from_google += jsonConversion(post_text(to_google_sen))
return from_google
if __name__ == '__main__':
for file_name in os.listdir("eng_txt"):
print("source: " + file_name)
with open("eng_txt/"+file_name) as f:
s = f.read()
new_text = split_text(s)
path_w = "jpn_txt/" + file_name
with open(path_w, mode='w') as f:
$ python translate_en_jp.py
source: report_4.txt
source: report_10.txt
source: report_2.txt
source: report_6.txt
source: report_9.txt
source: report_5.txt
source: report_8.txt
source: report_7.txt
source: report_3.txt
source: report_1.txt
Long text will take some time.
The translated text file will be in the jpn_txt folder.
Now you don't have to worry about English PDF! However, the text output by this has no concept of layout, and I think that it may not be translated well between pages. Originally, it would be nice if we could handle that area, but it seems to be quite difficult. I hope you can use it when you want to read a lot of PDFs in Japanese.
Recommended Posts