[PYTHON] 100 Language Processing Knock-51: Word Clipping

Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 51st "Cut out words" of .tohoku.ac.jp/nlp100/#ch6). This time, technically, it is almost the same as the previous time. A simple knock that ends with less than 10 lines of code.

Reference link

Link Remarks
051.Cut out words.ipynb Answer program GitHub link
100 amateur language processing knocks:51 Copy and paste source of many source parts

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

51. Cutting out words

Treat whitespace as word breaks, take 50 outputs as input, and output in the form of one word per line. However, output a blank line at the end of the sentence.

Answer

Answer Program [051. Word Clipping.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 051.% E5% 8D% 98% E8% AA% 9E% E3% 81% AE% E5% 88% 87% E3% 82% 8A% E5% 87% BA% E3% 81% 97.ipynb)

import re

with open('./050.result.txt') as file_in, \
     open('./051.result.txt', 'w') as file_out:
    for line in file_in:
        if line != '\n':
            line = re.sub(r'''
                         [\.|;|:|\?|!|,]*  # . or ; or : or ? or ! or ,Is 0 times or more
                         \s                 #Blank
                       ''', '\n', line, flags = re.VERBOSE)
            print(line, file=file_out)

Answer commentary

Regular expressions

Processing using regular expressions following the previous time. This time, replace the blank (space) with a line break. This time it's simpler because there are no positive look-ahead / look-behind assertions. Even if there is a symbol system before the blank, it is replaced.

Output result (execution result)

When the program is executed, the following result (excerpt from the first 20 lines) is output.

text:051.result.txt(Excerpt from the first 20 lines)


Natural
language
processing

From
Wikipedia
the
free
encyclopedia

Natural
language
processing
(NLP)
is
a
field
of
computer
science

Recommended Posts

100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-82 (Context Word): Context Extraction
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Natural language processing 3 Word continuity
100 Language Processing Knock-25: Template Extraction
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
Natural language processing 2 Word similarity
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
[Language processing 100 knocks 2020] Chapter 7: Word vector
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"