[Python] Format text full of line feed codes copied from PDF well

Introduction

Originally two times before, the previous article [Python] Let's automatically translate English PDF (but not limited to) with DeepL or Google Translate to make it a text file. Continued [Python] Let's automatically translate English PDF (but not limited to) with DeepL or Google Translate to make a text file, no HTML.

I wrote it for use in, but it seems to be useful, so I will introduce it separately.

Problems with text copied from PDF

I don't have detailed knowledge about PDF, In the PDF, it seems that the text is divided into small parts and written, and the copied text also contains the line feed code at the position as displayed in the PDF.

For example, in PDF

ABC.\\\DFE.\\\GHI.

In the case of display like, the copied text is

ABC.{\r\n}DEF.{\r\n}GHI.

And so on. (The above example is for Windows)

Then, I thought that I should erase the line feed code and connect the sentences.

ABC.DEF.GHI.

In this way, in this example, there is a period so that the sentences do not get mixed up.


However, it is not such a simple story to solve everything with this.

What about the following cases?

1. Introduction\\\ABCDEF.\\\GHIJKL.\\\MNOPQR.

If you simply erase the line feed code,

1. IntroductionABCDEF.GHIJKL.MNOPQR.

It became impossible to distinguish between the first line and the second line without a period.

So the problem is ** Parts that do not always have break marks such as periods, such as headings, How to infer and decompose from the copied text that only has hints for sentences and line feed codes. ** ** That is the point.

What i did

  1. Divide by line feed code
  2. Erase blank lines
  3. Guess whether it is a text or a headline from the difference in the number of characters between the divided sentence of interest and the next sentence.
  4. Judge whether the first letter of the next sentence is lowercase
  5. If all capital letters are used, it is judged as a headline sentence.
  6. If a number (Arabic numeral, Roman numeral) +. (Period) is in the head, it is judged as a headline sentence.
  7. Even if there is a large difference in the number of characters between the sentence of interest and the next sentence, if the next sentence is shorter and has a period (or a punctuation mark), it is judged to be a continuous sentence.
  8. Judge as a continuous sentence unless the parentheses are closed.

We adopted a method such as. It's pretty simple, but most sentences

・ Heading ・ Paragraph ・ Sentence

You now have a function that can be split in any of the units.

code

import re
import unicodedata


def len_(text):
    cnt = 0
    for t in text:
        if unicodedata.east_asian_width(t) in "FWA":
            cnt += 2
        else:
            cnt += 1
    return cnt


def textParser(text, n=30, bracketDetect=True):
    text = text.splitlines()
    sentences = []
    t = ""
    bra_cnt = ket_cnt = bra_cnt_jp = ket_cnt_jp = 0
    for i in range(len(text)):
        if not bool(re.search("\S", text[i])): continue
        if bracketDetect:
            bra_cnt += len(re.findall("[\((]", text[i]))
            ket_cnt += len(re.findall("[\))]", text[i]))
            bra_cnt_jp += len(re.findall("[""]", text[i]))
            ket_cnt_jp += len(re.findall("["" "]", text[i]))
        if i != len(text) - 1:
            if bool(re.fullmatch(r"[A-Z\s]+", text[i])):
                if t != "": sentences.append(t)
                t = ""
                sentences.append(text[i])
            elif bool(
                    re.match(
                        "(\d{1,2}[\.,、.]\s?(\d{1,2}[\.,、.]*)*\s?|I{1,3}V{0,1}X{0,1}[\.,、.]|V{0,1}X{0,1}I{1,3}[\.,、.]|[・ • ●])+\s",
                        text[i])) or re.match("\d{1,2}.\w", text[i]) or (
                            bool(re.match("[A-Z]", text[i][0]))
                            and abs(len_(text[i]) - len_(text[i + 1])) > n
                            and len_(text[i]) < n):
                if t != "": sentences.append(t)
                t = ""
                sentences.append(text[i])
            elif (
                    text[i][-1] not in ("。", ".", ".") and
                (abs(len_(text[i]) - len_(text[i + 1])) < n or
                 (len_(t + text[i]) > len_(text[i + 1]) and bool(
                     re.search("[。\..]\s\d|..[。\..]|.[。\..]", text[i + 1][-3:])
                     or bool(re.match("[A-Z]", text[i + 1][:1]))))
                 or bool(re.match("\s?[a-z,\)]", text[i + 1]))
                 or bra_cnt > ket_cnt or bra_cnt_jp > ket_cnt_jp)):
                t += text[i]
            else:
                sentences.append(t + text[i])
                t = ""
        else:
            sentences.append(t + text[i])
    return sentences

If the result is not good, try adjusting the value of n (larger is more together, smaller is more disjointed). If the number of parentheses is misaligned for some reason and the text is strangely frozen, set bracketDetect to False.

Example of use

Python 3.8.5 Documentation PDF (US-Letter paper size) \ tutorial.pdf From page number 5 (p.11)

A copy of the original text

CHAPTER
TWO
USING THE PYTHON INTERPRETER
2.1 Invoking the Interpreter
The Python interpreter is usually installed as /usr/local/bin/python3.8 on those machines where it is available;
putting /usr/local/bin in your Unix shell’s search path makes it possible to start it by typing the command:
python3.8
to the shell.1 Since the choice of the directory where the interpreter lives is an installation option, other places are possible;
check with your local Python guru or system administrator. (E.g., /usr/local/python is a popular alternative
location.)
On Windows machines where you have installed Python from the Microsoft Store, the python3.8 command will be
available. If you have the py.exe launcher installed, you can use the py command. See setting-envvars for other ways to
launch Python.
Typing an end-of-file character (Control-D on Unix, Control-Z on Windows) at the primary prompt causes the
interpreter to exit with a zero exit status. If that doesn’t work, you can exit the interpreter by typing the following command:
quit().
The interpreter’s line-editing features include interactive editing, history substitution and code completion on systems that
support the GNU Readline library. Perhaps the quickest check to see whether command line editing is supported is typing
Control-P to the first Python prompt you get. If it beeps, you have command line editing; see Appendix Interactive
Input Editing and History Substitution for an introduction to the keys. If nothing appears to happen, or if ^P is echoed,
command line editing isn’t available; you’ll only be able to use backspace to remove characters from the current line.
The interpreter operates somewhat like the Unix shell: when called with standard input connected to a tty device, it reads
and executes commands interactively; when called with a file name argument or with a file as standard input, it reads and
executes a script from that file.
A second way of starting the interpreter is python -c command [arg] ..., which executes the statement(s) in
command, analogous to the shell’s -c option. Since Python statements often contain spaces or other characters that are
special to the shell, it is usually advised to quote command in its entirety with single quotes.
Some Python modules are also useful as scripts. These can be invoked using python -m module [arg] ...,
which executes the source file for module as if you had spelled out its full name on the command line.
When a script file is used, it is sometimes useful to be able to run the script and enter interactive mode afterwards. This
can be done by passing -i before the script.
All command line options are described in using-on-general.
1 On Unix, the Python 3.x interpreter is by default not installed with the executable named python, so that it does not conflict with a simultaneously
installed Python 2.x executable.

There is a line feed code at the end of each line. If you display "with line feed code" for easy understanding

CHAPTER\r\nTWO\r\nUSING THE PYTHON INTERPRETER\r\n2.1 Invoking the Interpreter\r\nThe Python interpreter is usually installed as /usr/local/bin/python3.8 on those machines where it is available;\r\nputting /usr/local/bin in your Unix shell’s search path makes it possible to start it by typing the command:\r\npython3.8\r\nto the shell.1 Since the choice of the directory where the interpreter lives is an installation option, other places are possible;\r\ncheck with your local Python guru or system administrator. (E.g., /usr/local/python is a popular alternative\r\nlocation.)\r\nOn Windows machines where you have installed Python from the Microsoft Store, the python3.8 command will be\r\navailable. If you have the py.exe launcher installed, you can use the py command. See setting-envvars for other ways to\r\nlaunch Python.\r\nTyping an end-of-file character (Control-D on Unix, Control-Z on Windows) at the primary prompt causes the\r\ninterpreter to exit with a zero exit status. If that doesn’t work, you can exit the interpreter by typing the following command:\r\nquit().\r\nThe interpreter’s line-editing features include interactive editing, history substitution and code completion on systems that\r\nsupport the GNU Readline library. Perhaps the quickest check to see whether command line editing is supported is typing\r\nControl-P to the first Python prompt you get. If it beeps, you have command line editing; see Appendix Interactive\r\nInput Editing and History Substitution for an introduction to the keys. If nothing appears to happen, or if ^P is echoed,\r\ncommand line editing isn’t available; you’ll only be able to use backspace to remove characters from the current line.\r\nThe interpreter operates somewhat like the Unix shell: when called with standard input connected to a tty device, it reads\r\nand executes commands interactively; when called with a file name argument or with a file as standard input, it reads and\r\nexecutes a script from that file.\r\nA second way of starting the interpreter is python -c command [arg] ..., which executes the statement(s) in\r\ncommand, analogous to the shell’s -c option. Since Python statements often contain spaces or other characters that are\r\nspecial to the shell, it is usually advised to quote command in its entirety with single quotes.\r\nSome Python modules are also useful as scripts. These can be invoked using python -m module [arg] ...,\r\nwhich executes the source file for module as if you had spelled out its full name on the command line.\r\nWhen a script file is used, it is sometimes useful to be able to run the script and enter interactive mode afterwards. This\r\ncan be done by passing -i before the script.\r\nAll command line options are described in using-on-general.\r\n1 On Unix, the Python 3.x interpreter is by default not installed with the executable named python, so that it does not conflict with a simultaneously\r\ninstalled Python 2.x executable.

It looks like this.

I will throw it in the function I made this time. Assuming the above sentence is copied to the clipboard

from pyperclip import paste #Function to get the value (text) from the clipboard

print("\n".join(textParser(paste())))

out


CHAPTER
TWO
USING THE PYTHON INTERPRETER
2.1 Invoking the Interpreter
The Python interpreter is usually installed as /usr/local/bin/python3.8 on those machines where it is available;putting /usr/local/bin in your Unix shell’s search path makes it possible to start it by typing the command:python3.8to the shell.1 Since the choice of the directory where the interpreter lives is an installation option, other places are possible;check with your local Python guru or system administrator. (E.g., /usr/local/python is a popular alternativelocation.)On Windows machines where you have installed Python from the Microsoft Store, the python3.8 command will beavailable. If you have the py.exe launcher installed, you can use the py command. See setting-envvars for other ways tolaunch Python.
Typing an end-of-file character (Control-D on Unix, Control-Z on Windows) at the primary prompt causes theinterpreter to exit with a zero exit status. If that doesn’t work, you can exit the interpreter by typing the following command:quit().
The interpreter’s line-editing features include interactive editing, history substitution and code completion on systems thatsupport the GNU Readline library. Perhaps the quickest check to see whether command line editing is supported is typingControl-P to the first Python prompt you get. If it beeps, you have command line editing; see Appendix InteractiveInput Editing and History Substitution for an introduction to the keys. If nothing appears to happen, or if ^P is echoed,command line editing isn’t available; you’ll only be able to use backspace to remove characters from the current line.
The interpreter operates somewhat like the Unix shell: when called with standard input connected to a tty device, it readsand executes commands interactively; when called with a file name argument or with a file as standard input, it reads andexecutes a script from that file.
A second way of starting the interpreter is python -c command [arg] ..., which executes the statement(s) incommand, analogous to the shell’s -c option. Since Python statements often contain spaces or other characters that arespecial to the shell, it is usually advised to quote command in its entirety with single quotes.
Some Python modules are also useful as scripts. These can be invoked using python -m module [arg] ...,which executes the source file for module as if you had spelled out its full name on the command line.
When a script file is used, it is sometimes useful to be able to run the script and enter interactive mode afterwards. Thiscan be done by passing -i before the script.
installed Python 2.x executable.

Pretty good!

That said, it's natural to be able to split such a beautiful PDF neatly. You can also use it for more complicated PDFs that switch between 1 and 2 columns, so give it a try.

Summary

It's by no means perfect for every PDF, but it's a decent one. Japanese PDF is also supported, but Japanese PDF is often swayed by OCR, so in that case you can erase the space with .replace ("", "") I think it will be clean (although this method cannot be used if English is included). As mentioned above, when you want to translate a PDF, there are many uses, so please use it.

Recommended Posts

[Python] Format text full of line feed codes copied from PDF well
Python: Japanese text: Characteristic of utterance from word similarity
[Python] Get the text of the law from the e-GOV Law API
Python: Japanese text: Characteristic of utterance from word continuity
[Python] Extract text data from XML data of 10GB or more.
OCR from PDF in Python
Full understanding of Python debugging
Speed comparison of Wiktionary full text processing with F # and Python
[python] Extract text from pdf and read characters aloud with Open-Jtalk