[PYTHON] I made a library to separate Japanese sentences nicely

Introduction

In recent years, the development of natural language processing technology has been remarkable, and its application is being promoted in various fields. I often do work that utilizes natural language processing technology and AI, but the most troublesome (but important) work is related to various pre-processing.

Some of the main pre-processing you'll be doing for most tasks include:

I mainly use Python, but I didn't have a suitable library for ** Japanese sentence breaks **, so I ended up writing similar code every time. I'm sure there are about 100 people in the world who have similar problems, so I decided to write my own library and publish it as OSS, but it was the beginning of 2019. It's about time. However, I couldn't secure enough time and motivation, and it was delayed, but I was finally able to start by setting the limit of writing articles on the Advent Calendar.

Specific sentence break issues

I think the following are more commonly used as simple sentence delimiters.

However, there are many actual documents that cannot be separated well by the above simple rules.

There are punctuation marks and exclamation marks in "" and ()

For example, I answered," Yes, that's right. " If you simply separate text like by a punctuation mark, it will be split as follows:

I think there are some good situations, but I answered, "Yes, that's right." You may want to treat it as one sentence, `.

Line breaks in the middle of the sentence

For example, for reasons such as not fitting on one screen, line breaks may occur in the middle of a sentence as shown below (especially for documents in a company).

In natural language processing, ~ omitted ~
It is commonly used.

If this is separated by a line break, it will be divided into two sentences, but in natural language processing, it is common to use ~ omitted ~. You may want to separate it as one sentence, .

In the above example, if you delete the line breaks and then separate them with punctuation marks, you can do something about it, but ** contains sentences that do not have punctuation marks **, which makes it much more troublesome. (~~ Please add a punctuation mark ... ~~)

Quote block for emails, etc.

>>I was planning to go to the barber shop tomorrow, but "in a hurry
>>Change the schedule. Therefore, at the meeting
>>Let me change the schedule. ".

I've acknowledged.

Cases where there are line breaks and unnecessary symbols at the beginning of lines in the middle of a sentence. There is a theory that it is the most common case in corporate documents (subjective). The easiest approach is to remove the symbols and line breaks first and then process them. However, it is rare that you want to combine them while removing unnecessary symbols and leaving the information that they are a block of citations.

Related techniques

GiNZA GiNZA is a library that can be used to separate Japanese sentences in Python. Sentence delimiters using GiNZA can be done as follows.

import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('I was told, "I can't answer your thoughts. I want you to hit others." Stunned\n I had no choice but to stand there, but I still want to believe!')
for sent in doc.sents:
  print(sent)
I said, "I can't answer your thoughts.
I want you to hit the other.
"They said!
I was stunned and had no choice but to stand there
Still I want to believe!

The advantage of using GiNZA is that it can detect sentence breaks with high accuracy even if line breaks are made in the middle of a sentence or punctuation is omitted because the dependency analysis is performed properly. .. Although it is a heavyweight class, I think it is a good option if you also use other functions of GiNZA.

sentence-splitter It is a tool made by Node.js, but there is also sentence-splitter.

echo -e "Said, "I can't answer your thoughts. I want you to hit others." Stunned\n I had no choice but to stand there, but I still want to believe!" | sentence-splitter 
Sentence 0:I was told, "I can't answer your thoughts. I want you to hit others."
Sentence 1:Stunned
I had no choice but to stand there, but I still want to believe!

This tool also uses the parser used inside textlint for advanced analysis, so it is accurate even if line breaks occur in the middle of a sentence. It is divided high. Also, it is attractive that I like the handling of "" etc. and that the processing performance is quite fast. (If it wasn't Node.js, I would have adopted it)

Pragmatic Segmenter Although it is a Ruby library, there is a Pragmatic Segmenter. It is a rule-based sentence delimiter library, and its major advantage is that it supports ** multiple languages **. It is also attractive because it does not perform complicated analysis and is quick to process.

Since the Japanese sentence break rule is close to my taste, the goal of this tool development is "to be able to break Japanese sentences equal to or better than the Pragmatic Segmenter".

A Live Demo is available for this tool, and the results I tried there are shown below.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<wrapper>
<s>I was told, "I can't answer your thoughts. I want you to hit others."</s>
<s>Stunned</s>
<s>I had no choice but to stand there, but I still want to believe!</s>
</wrapper>

By the way, a Python port of this Pragmatic Segmenter is being developed as pySBD. Unfortunately, it seems that the rules for Japanese have not been ported yet.

What I made

So, the library I made this time is open to the public at ↓. https://github.com/wwwcojp/ja_sentence_segmenter

In creating the library, I developed it with the following goals.

How to use

Installation

It's published to PyPI (https://pypi.org/project/ja-sentence-segmenter/), so you can easily install it with pip. It supports Python 3.6 and above, and there are no dependent libraries so far.

$ pip install ja-sentence-segmenter

Run

There are punctuation marks and exclamation marks in "" and () & line breaks in the middle of the sentence

import functools

from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.normalize.neologd_normalizer import normalize
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation

split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_tail_te = functools.partial(concatenate_matching, former_matching_rule=r"^(?P<result>.+)(hand)$", remove_former_matched=False)
segmenter = make_pipeline(normalize, split_newline, concat_tail_te, split_punc2)

text1 = """
I was told, "I can't answer your thoughts. I want you to hit others." Stunned
I had no choice but to stand there, but I still want to believe!
"""
print(list(segmenter(text1)))
['I was told, "I can't answer your thoughts. I want you to hit others."!', 'I was stunned and had no choice but to stand there, but I still want to believe!']

Email quote block

import functools

from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.normalize.neologd_normalizer import normalize
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation

split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_mail_quote = functools.partial(concatenate_matching,
  former_matching_rule=r"^(\s*[>]+\s*)(?P<result>.+)$",
  latter_matching_rule=r"^(\s*[>]+\s*)(?P<result>.+)$",
  remove_former_matched=False,
  remove_latter_matched=True)
segmenter = make_pipeline(normalize, split_newline, concat_mail_quote, split_punc2)

text2 = """
>>I was planning to go to the barber shop tomorrow, but "in a hurry
>>Change the schedule. Therefore, at the meeting
>>Let me change the schedule. ".

I've acknowledged.
"""

print(list(segmenter(text2)))
['>>I was planning to go to the barber shop tomorrow, but he said, "I will change my schedule in a hurry. Please let me change the schedule of the meeting."', 'I've acknowledged.']

Future tasks

I'm almost exhausted, so I'll finish by stating future issues.

Impressions

~~ Why did you do such a sober thing in the Advent Calendar article ... ~~

Recommended Posts

I made a library to separate Japanese sentences nicely
I made a python library to do rolling rank
I made a library to easily read config files with Python
I made a script to display emoji
I made a library that adds docstring to a Python stub file.
I made a library for actuarial science
PyPi debut I tried to pip install a library to check Japanese holidays
I made a library konoha that switches the tokenizer to a nice feeling
I made a tool to compile Hy natively
I made a tool to get new articles
I made a library to operate AWS CloudFormation stack from CUI (Python Fabric)
I made a script to put a snippet in README.md
I made a Python module to translate comment outs
I made a code to convert illustration2vec to keras model
I made a command to markdown the table clipboard
How to test the current time with Go (I made a very thin library)
I made a python text
I made a discord bot
I made a package to filter time series with python
I made a box to rest before Pepper gets tired
I made a command to generate a table comment in Django
I made a tool to create a word cloud from wikipedia
I made a function to check the model of DCGAN
[Titan Craft] I made a tool to summon a giant to Minecraft
I made you to execute a command from a web browser
I made a script to say hello at my Koshien
I made a C ++ learning site
I made a program to solve (hint) Saizeriya's spot the difference
I made a Line-bot using Python!
I made a CUI-based translation script (2)
I made a wikipedia gacha bot
I made a web server with Raspberry Pi to watch anime
I created a Python library to call the LINE WORKS API
I made my own Python library
I made a fortune with Python.
I made a CUI-based translation script
I made a garbled generator that encodes favorite sentences from UTF-8 to Shift-JIS (cp932) in Python
I made a command to display a colorful calendar in the terminal
I made a daemon with Python
I made a Python wrapper library for docomo image recognition API.
I made a Docker container to use JUMAN ++, KNP, python (for pyKNP).
[Python] I made a decorator that doesn't seem to have any use.
I made a password generator to teach Python3 to children (bonus) * Completely remade
I made a tool to automatically browse multiple sites with Selenium (Python)
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
I made a web application in Python that converts Markdown to HTML
I created a Go library nzargv that arranges command line arguments nicely.
[Django] I made a field to enter the date with 4 digit numbers
I made a kitchen timer to be displayed on the status bar!
I made a CLI tool to convert images in each directory to PDF
I tried to discriminate a 6-digit number with a number discrimination application made with python
I made a program to notify you by LINE when switches arrive
I made a network to convert black and white images to color images (pix2pix)
I made a script in python to convert .md files to Scrapbox format
I made a program to input what I ate and display calories and sugar
I made a tool to convert Jupyter py to ipynb with VS Code
I made a program to check the size of a file in Python
I made a function to see the movement of a two-dimensional array (Python)
I made a dash docset for Holoviews
I want to print in a comprehension
I tried to create a linebot (preparation)