"Continued [Python] Let's automatically translate English PDF (but not limited to) with DeepL or Google Translate to make it a text file, no HTML." https://qiita.com/Cartelet/items/a00d4cec8216d04f9274
I tried.
There are several versions shown in the article, but this time I used the "text decomposition enhanced version code" added on 8/11.
Enabled to decompose paragraphs to some extent without going through Word. (Approximately) Since each paragraph is translated, the translation speed is much faster than one sentence at a time.
I saved this code as pdftrans.py.
$ python3 pdftrans.py
When executed as
Traceback (most recent call last):
File "pdftrans.py", line 1, in <module>
from selenium import webdriver
ModuleNotFoundError: No module named 'selenium'
Because it came out
$ sudo pip3 install selenium
Installed as. I needed pyperclip as well, so
$ sudo pip3 install pyperclip
Installed as.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver.exe': 'chromedriver.exe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pdftrans.py", line 165, in <module>
TranslateFromClipboard(*args)
File "pdftrans.py", line 75, in TranslateFromClipboard
chrome_options=options)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
self.service.start()
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver.exe' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
Because it became
https://sites.google.com/a/chromium.org/chromedriver/homeから chromedriver_linux64.zip Download and unpack.
$ ls -alh
16M in total
drwxrwxr-x 2 nanbuwks nanbuwks 4.0K August 17 21:07 .
drwxrwxr-x 91 nanbuwks nanbuwks 56K August 17 21:06 ..
-rwxr-xr-x 1 nanbuwks nanbuwks 11M May 29 06:05 chromedriver
-rw-rw-r-- 1 nanbuwks nanbuwks 5.1M August 17 21:06 chromedriver_linux64.zip
-rw-r--r-- 1 nanbuwks nanbuwks 8.0K August 17 16:53 pdftrans.py
I placed it in the same directory as pdftrans.py.
Then line 9 of pdftrans.py
DRIVER_PATH = 'chromedriver.exe'
Changed the place where it is as follows.
DRIVER_PATH = './chromedriver'
For the time being, I decided to run pdftrans.py in the directory containing the script, simplifying the PATH setting.
Traceback (most recent call last):
File "pdftrans.py", line 165, in <module>
TranslateFromClipboard(*args)
File "pdftrans.py", line 75, in TranslateFromClipboard
chrome_options=options)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 84
I was using Version 83 for Chrome, but I updated it to the latest version and changed it to version: 84.0.4147.125 (Official Build) (64-bit).
$ python3 pdftrans.py
1.English → Japanese 2.Japanese → English 1
1. DeepL 2.GoogleTranslate 1
Do you want to export the translation result? Y/n y
1. txt 2. HTML 3. both 2
Enter a name for the output file (default is'translated_text.html')
Please enter the title (of the paper) zigbeebdb
Would you like to see the translation progress here? Y/n n
Press Enter when ready
1/900 0% done
2/900 0% done
3/900 0% done
4/900 0% done
・
・
・
――This time, I tried the 87-page PDF. --In the above example, the PDF was opened in Google Chrome and copied to the clipboard, and it was processed in 900 sentences. Later, when I opened the PDF with evince and copied it to the clipboard, it was processed with 1249 sentences. ―― 1249 Sentence processing takes about 30 minutes. ――With DeepL, I ran it with TimeUp Google Translate in about 1/20 of the whole. ――If you want to use Google Translate anyway, it might have been better to try it with the high-speed version of 8/16.
8/16 postscript Speeded up the ability to open a large number of Chrome with multithreading. This is HTML only export. Also, in the case of DeepL, please note that if you open too much, it will be restricted and translation will stop.
――I thought it would be okay if I did other work even if it took a long time, but since the clipboard is controlled, I feel that parallel work is tricky. ――The usage was written in the article that was the basis of the referenced article, but I tried to use it without reading it properly, and that was it. If you read it properly and copy the PDF character information to the clipboard in advance, it works properly!
"[Python] Let's automatically translate English PDF (but not limited to) with DeepL or Google Translate to make it a text file." https://qiita.com/Cartelet/items/c56477033cda17a2a28a
With 30 simultaneous parallels, the same 87-page 1249 sentence document as above could be translated in about 10 minutes, but the load average was great.
It seems good to devise according to the environment so that overhead loss is not applied.
Actually, when I set it to 10 at the same time, it took about 6 minutes to process.
Recommended Posts