Language processing 100 knocks 2015 "Chapter 6: Processing English texts" It is a record of 50th "sentence break" of .tohoku.ac.jp/nlp100/#ch6). Compared to the difficult 49, it's very easy and it feels like a short break. Separate statements using regular expressions.

Reference link

Link	Remarks
050.Sentence break.ipynb	Answer program GitHub link
100 amateur language processing knocks:50	Copy and paste source of many source parts

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

50. Sentence break

(. Or; or: or? Or!) → Whitespace characters → Consider the pattern of uppercase letters as sentence delimiters, and output the input document in the form of one sentence per line.

Answer

Answer program [050. Sentence break.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 050.% E6% 96% 87% E5% 8C% BA% E5 % 88% 87% E3% 82% 8A.ipynb)

import re

with open('./nlp.txt') as file_in, \
     open('./050.result.txt', 'w') as file_out:
    for line in file_in:
        if line != '\n':
            line = re.sub(r'''
                         (?<=[\.|;|:|\?|!]) #With affirmative look-behind. or ; or : or ? or !
                         \s                 #Blank(Replacement target for line breaks)
                         (?=[A-Z])          #Uppercase letters with affirmative look-ahead
                       ''', '\n', line, flags = re.VERBOSE)
            print(line.rstrip(), file=file_out)

Answer commentary

Affirmative look-ahead / look-ahead

This time we're using affirmative look-ahead and look-behind assertions in regular expressions. Although it is not included in the match target (replacement target this time), it is used as a search condition. For more information, see ["Basics and Tips for Python Regular Expressions Learned from Zero"](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#%E5%85%88%E8%AA%AD%E3%81%BF % E5% BE% 8C% E8% AA% AD% E3% 81% BF% E3% 82% A2% E3% 82% B5% E3% 83% BC% E3% 82% B7% E3% 83% A7% E3 Please refer to% 83% B3).

Output result (execution result)

When the program is executed, the following results (only the first 10 lines) are output.

`text:050.result.txt(Only the first 10 lines)`


Natural language processing
From Wikipedia, the free encyclopedia
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of humani-computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
History
The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would be a solved problem.

[PYTHON] 100 language processing knock-50: sentence break