Introduction

In recent years, the development of natural language processing technology has been remarkable, and its application is being promoted in various fields. I often do work that utilizes natural language processing technology and AI, but the most troublesome (but important) work is related to various pre-processing.

Some of the main pre-processing you'll be doing for most tasks include:

Cleaning
Removes noise in text such as HTML tags and symbols
Normalization
Unification of full-width / half-width characters, uppercase / lowercase letters, etc.
** sentence segmentation **
Detects and splits between sentences
Tokenization
Split a sentence into columns of words
Stop word removal
Remove unnecessary words for the task you want to solve

I mainly use Python, but I didn't have a suitable library for ** Japanese sentence breaks **, so I ended up writing similar code every time. I'm sure there are about 100 people in the world who have similar problems, so I decided to write my own library and publish it as OSS, but it was the beginning of 2019. It's about time. However, I couldn't secure enough time and motivation, and it was delayed, but I was finally able to start by setting the limit of writing articles on the Advent Calendar.

Specific sentence break issues

I think the following are more commonly used as simple sentence delimiters.

Separated by line breaks
Separated by a symbol (.!? Etc.)

However, there are many actual documents that cannot be separated well by the above simple rules.

There are punctuation marks and exclamation marks in "" and ()

For example, I answered," Yes, that's right. " If you simply separate text like by a punctuation mark, it will be split as follows:

I said "Yes.
That's right.
"

I think there are some good situations, but I answered, "Yes, that's right." You may want to treat it as one sentence, `.

Line breaks in the middle of the sentence

For example, for reasons such as not fitting on one screen, line breaks may occur in the middle of a sentence as shown below (especially for documents in a company).

In natural language processing, ~ omitted ~
It is commonly used.

If this is separated by a line break, it will be divided into two sentences, but in natural language processing, it is common to use ~ omitted ~. You may want to separate it as one sentence, .

In the above example, if you delete the line breaks and then separate them with punctuation marks, you can do something about it, but ** contains sentences that do not have punctuation marks **, which makes it much more troublesome. (~~ Please add a punctuation mark ... ~~)

Quote block for emails, etc.

>>I was planning to go to the barber shop tomorrow, but "in a hurry
>>Change the schedule. Therefore, at the meeting
>>Let me change the schedule. ".

I've acknowledged.

Cases where there are line breaks and unnecessary symbols at the beginning of lines in the middle of a sentence. There is a theory that it is the most common case in corporate documents (subjective). The easiest approach is to remove the symbols and line breaks first and then process them. However, it is rare that you want to combine them while removing unnecessary symbols and leaving the information that they are a block of citations.

Related techniques

GiNZA GiNZA is a library that can be used to separate Japanese sentences in Python. Sentence delimiters using GiNZA can be done as follows.

import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('I was told, "I can't answer your thoughts. I want you to hit others." Stunned\n I had no choice but to stand there, but I still want to believe!')
for sent in doc.sents:
  print(sent)

I said, "I can't answer your thoughts.
I want you to hit the other.
"They said!
I was stunned and had no choice but to stand there
Still I want to believe!

The advantage of using GiNZA is that it can detect sentence breaks with high accuracy even if line breaks are made in the middle of a sentence or punctuation is omitted because the dependency analysis is performed properly. .. Although it is a heavyweight class, I think it is a good option if you also use other functions of GiNZA.

sentence-splitter It is a tool made by Node.js, but there is also sentence-splitter.

echo -e "Said, "I can't answer your thoughts. I want you to hit others." Stunned\n I had no choice but to stand there, but I still want to believe!" | sentence-splitter

Sentence 0:I was told, "I can't answer your thoughts. I want you to hit others."
Sentence 1:Stunned
I had no choice but to stand there, but I still want to believe!

This tool also uses the parser used inside textlint for advanced analysis, so it is accurate even if line breaks occur in the middle of a sentence. It is divided high. Also, it is attractive that I like the handling of "" etc. and that the processing performance is quite fast. (If it wasn't Node.js, I would have adopted it)

Pragmatic Segmenter Although it is a Ruby library, there is a Pragmatic Segmenter. It is a rule-based sentence delimiter library, and its major advantage is that it supports ** multiple languages **. It is also attractive because it does not perform complicated analysis and is quick to process.

Since the Japanese sentence break rule is close to my taste, the goal of this tool development is "to be able to break Japanese sentences equal to or better than the Pragmatic Segmenter".

A Live Demo is available for this tool, and the results I tried there are shown below.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<wrapper>
<s>I was told, "I can't answer your thoughts. I want you to hit others."</s>
<s>Stunned</s>
<s>I had no choice but to stand there, but I still want to believe!</s>
</wrapper>

By the way, a Python port of this Pragmatic Segmenter is being developed as pySBD. Unfortunately, it seems that the rules for Japanese have not been ported yet.

What I made

So, the library I made this time is open to the public at ↓. https://github.com/wwwcojp/ja_sentence_segmenter

In creating the library, I developed it with the following goals.

Being able to handle sentence breaks in the cases listed above
Flexibility that can be customized to some extent
Minimal dependent libraries
Processing speed is moderate, I want to reduce memory usage
Add type hints for static inspection

How to use

Installation

It's published to PyPI (https://pypi.org/project/ja-sentence-segmenter/), so you can easily install it with pip. It supports Python 3.6 and above, and there are no dependent libraries so far.

$ pip install ja-sentence-segmenter

Run

There are punctuation marks and exclamation marks in "" and () & line breaks in the middle of the sentence

import functools

from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.normalize.neologd_normalizer import normalize
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation

split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_tail_te = functools.partial(concatenate_matching, former_matching_rule=r"^(?P<result>.+)(hand)$", remove_former_matched=False)
segmenter = make_pipeline(normalize, split_newline, concat_tail_te, split_punc2)

text1 = """
I was told, "I can't answer your thoughts. I want you to hit others." Stunned
I had no choice but to stand there, but I still want to believe!
"""
print(list(segmenter(text1)))

['I was told, "I can't answer your thoughts. I want you to hit others."!', 'I was stunned and had no choice but to stand there, but I still want to believe!']

Email quote block

import functools

from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.normalize.neologd_normalizer import normalize
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation

split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_mail_quote = functools.partial(concatenate_matching,
  former_matching_rule=r"^(\s*[>]+\s*)(?P<result>.+)$",
  latter_matching_rule=r"^(\s*[>]+\s*)(?P<result>.+)$",
  remove_former_matched=False,
  remove_latter_matched=True)
segmenter = make_pipeline(normalize, split_newline, concat_mail_quote, split_punc2)

text2 = """
>>I was planning to go to the barber shop tomorrow, but "in a hurry
>>Change the schedule. Therefore, at the meeting
>>Let me change the schedule. ".

I've acknowledged.
"""

print(list(segmenter(text2)))

['>>I was planning to go to the barber shop tomorrow, but he said, "I will change my schedule in a hurry. Please let me change the schedule of the meeting."', 'I've acknowledged.']

Future tasks

I'm almost exhausted, so I'll finish by stating future issues.

Comparison verification with each tool
I want to make a comparison table of functions and performance
Document maintenance
To be honest, I think it's a library that is too versatile to use intuitively, so I would like to prepare a recipe collection according to the purpose.
Function addition
More advanced sentence breaks using the dependency analysis library, etc.
Maintenance of the library including other preprocessing

Impressions

~~ Why did you do such a sober thing in the Advent Calendar article ... ~~

[PYTHON] I made a library to separate Japanese sentences nicely