[PYTHON] Try translating English PDF Part 1

"Continued [Python] Let's automatically translate English PDF (but not limited to) with DeepL or Google Translate to make it a text file, no HTML." https://qiita.com/Cartelet/items/a00d4cec8216d04f9274

I tried.

environment

There are several versions shown in the article, but this time I used the "text decomposition enhanced version code" added on 8/11.

Enabled to decompose paragraphs to some extent without going through Word. (Approximately) Since each paragraph is translated, the translation speed is much faster than one sentence at a time.

I saved this code as pdftrans.py.

Library settings, etc.


$ python3 pdftrans.py 

When executed as


Traceback (most recent call last):
  File "pdftrans.py", line 1, in <module>
    from selenium import webdriver
ModuleNotFoundError: No module named 'selenium'

Because it came out


$ sudo pip3 install selenium

Installed as. I needed pyperclip as well, so


$ sudo pip3 install pyperclip

Installed as.


Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver.exe': 'chromedriver.exe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pdftrans.py", line 165, in <module>
    TranslateFromClipboard(*args)
  File "pdftrans.py", line 75, in TranslateFromClipboard
    chrome_options=options)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver.exe' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

Because it became

https://sites.google.com/a/chromium.org/chromedriver/homeから chromedriver_linux64.zip Download and unpack.


$ ls -alh
16M in total
drwxrwxr-x  2 nanbuwks nanbuwks 4.0K August 17 21:07 .
drwxrwxr-x 91 nanbuwks nanbuwks 56K August 17 21:06 ..
-rwxr-xr-x 1 nanbuwks nanbuwks 11M May 29 06:05 chromedriver
-rw-rw-r--  1 nanbuwks nanbuwks 5.1M August 17 21:06 chromedriver_linux64.zip
-rw-r--r--  1 nanbuwks nanbuwks 8.0K August 17 16:53 pdftrans.py

I placed it in the same directory as pdftrans.py.

Then line 9 of pdftrans.py


DRIVER_PATH = 'chromedriver.exe'

Changed the place where it is as follows.


DRIVER_PATH = './chromedriver'

For the time being, I decided to run pdftrans.py in the directory containing the script, simplifying the PATH setting.


Traceback (most recent call last):
  File "pdftrans.py", line 165, in <module>
    TranslateFromClipboard(*args)
  File "pdftrans.py", line 75, in TranslateFromClipboard
    chrome_options=options)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 84

I was using Version 83 for Chrome, but I updated it to the latest version and changed it to version: 84.0.4147.125 (Official Build) (64-bit).

Run


$ python3 pdftrans.py 
1.English → Japanese 2.Japanese → English 1
1. DeepL 2.GoogleTranslate  1
Do you want to export the translation result? Y/n  y
1. txt 2. HTML 3. both    2
Enter a name for the output file (default is'translated_text.html')  
Please enter the title (of the paper) zigbeebdb
Would you like to see the translation progress here? Y/n  n
Press Enter when ready

1/900  0% done


2/900  0% done


3/900  0% done


4/900  0% done

・
・
・

result

image.png

Try using

――This time, I tried the 87-page PDF. --In the above example, the PDF was opened in Google Chrome and copied to the clipboard, and it was processed in 900 sentences. Later, when I opened the PDF with evince and copied it to the clipboard, it was processed with 1249 sentences. ―― 1249 Sentence processing takes about 30 minutes. ――With DeepL, I ran it with TimeUp Google Translate in about 1/20 of the whole. ――If you want to use Google Translate anyway, it might have been better to try it with the high-speed version of 8/16.

8/16 postscript Speeded up the ability to open a large number of Chrome with multithreading. This is HTML only export. Also, in the case of DeepL, please note that if you open too much, it will be restricted and translation will stop.

――I thought it would be okay if I did other work even if it took a long time, but since the clipboard is controlled, I feel that parallel work is tricky. ――The usage was written in the article that was the basis of the referenced article, but I tried to use it without reading it properly, and that was it. If you read it properly and copy the PDF character information to the clipboard in advance, it works properly!

"[Python] Let's automatically translate English PDF (but not limited to) with DeepL or Google Translate to make it a text file." https://qiita.com/Cartelet/items/c56477033cda17a2a28a

Postscript I tried with the high-speed version of 8/16 version

With 30 simultaneous parallels, the same 87-page 1249 sentence document as above could be translated in about 10 minutes, but the load average was great. image.png

It seems good to devise according to the environment so that overhead loss is not applied.

Actually, when I set it to 10 at the same time, it took about 6 minutes to process.

Recommended Posts

Try translating English PDF Part 1
English PDF in Japanese
Try translating with Python while maintaining the PDF layout
Try normal Linux programming Part 7
Try normal Linux programming Part 3
Try using SQLAlchemy + MySQL (Part 1)
Try normal Linux programming Part 4
Try using SQLAlchemy + MySQL (Part 2)
Try normal Linux programming Part 6
Try using Pillow on iPython (Part 1)
Try using Pillow on iPython (Part 2)
Try deep learning with TensorFlow Part 2
Try using Pillow on iPython (Part 3)