Improved to make the output result easier to see Continued [Python] Let's automatically translate English PDF (but not limited to) with DeepL or Google Translate to make a text file, no HTML.
It's hard to read English papers, isn't it? Let's have it translated, the outlook will be much better.
The problem with translating PDFs is the difficulty of handling PDF files. Even if you rely on the library to extract characters automatically, it doesn't work, or the order of sentences is messed up. So this time I would like to translate via the clipboard.
The flow is
Open the PDF file with Chrome etc. and select all "Ctrl + A" to copy ↓ Run the program ↓ Break down sentences so that they do not exceed the character limit (5000 characters) and are separated by periods. ↓ Throw to a translation site ↓ Get results ↓ output
It's like that.
pip install selenium
pip install pyperclip
Please put it in the same directory → ChromeDriver
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
import pyperclip as ppc
DRIVER_PATH = 'chromedriver.exe'
options = Options()
options.add_argument('--disable-gpu')
options.add_argument('--disable-extensions')
options.add_argument('--proxy-server="direct://"')
options.add_argument('--proxy-bypass-list=*')
options.add_argument('--start-maximized')
def parse_merge(text, n=4900):
sentences = []
sentence = ""
for i in " ".join(text.splitlines()).split("."):
if len(sentence) + len(i) > n:
sentences.append(sentence)
sentence = ""
sentence += i + "."
sentences.append(sentence)
return sentences
def TranslateFromClipboard(tool, write, filename, isPrint):
driver = webdriver.Chrome(executable_path=DRIVER_PATH,
chrome_options=options)
url = 'https://www.deepl.com/ja/translator' if tool == "DeepL" else 'https://translate.google.co.jp/?hl=ja&tab=TT&authuser=0#view=home&op=translate&sl=auto&tl=ja'
driver.get(url)
transSentence = ""
if tool == "DeepL":
textarea = driver.find_element_by_css_selector(
'.lmt__textarea.lmt__source_textarea.lmt__textarea_base_style')
elif tool == "GT":
textarea = driver.find_element_by_id('source')
for sentence in parse_merge(ppc.paste()):
cbText = ppc.paste()
ppc.copy(sentence)
textarea.send_keys(Keys.CONTROL, "v")
ppc.copy(cbText)
transtext = ""
while transtext == "":
if tool == "DeepL":
transtext = driver.find_element_by_css_selector(
'.lmt__textarea.lmt__target_textarea.lmt__textarea_base_style'
).get_property("value")
elif tool == "GT":
try:
transtext = driver.find_element_by_css_selector(
'.tlid-translation.translation').text
except:
pass
time.sleep(1)
if isPrint: print(transtext)
transSentence += transtext
textarea.send_keys(Keys.CONTROL, "a")
textarea.send_keys(Keys.BACKSPACE)
driver.quit()
if write:
with open(filename, "w", encoding='UTF-8') as f:
for sentence in transSentence.split("。"):
f.write(sentence + "。\n")
if __name__ == "__main__":
args = ["DeepL", False, "translated_text.txt", True]
if input("1. DeepL 2.GoogleTranslate ") == "2": args[0] = "GT"
if input("Do you want to write the translation result to a file? Y/n ") == "y":
args[1] = True
filename = input(
"Enter a name for the output file (default is'translated_text.txt') ")
if filename:
args[2] = filename
if input("Would you like to see the translation progress here? Y/n ") == "n":
args[3] = False
input("Press Enter when ready")
TranslateFromClipboard(*args)
When outputting a text file, line breaks are made at the punctuation marks for the time being, so please rewrite as appropriate.
DeepL seems to be able to translate documents for a fee, but if possible, those who want to benefit for free, Originally, documents can be translated with Google Translate, but people who are desperate because there are places where they are not translated or the mathematical formulas are messed up. Try it, maybe it will make progress.
Recommended Posts