[PYTHON] 100 Language Processing Knock-52: Stemming

Language processing 100 knocks 2015 "Chapter 6: Processing English texts" It is a record of 52nd "Stemming" of .tohoku.ac.jp/nlp100/#ch6). ** Stemming that you will often use in language processing **. It is actually used in the machine learning knock of 71st subsequent knock. It's technically very easy because it just calls a function.

Reference link

Link Remarks
052.Stemming.ipynb Answer program GitHub link
100 amateur language processing knocks:52 Copy and paste source of many source parts

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip. This time, I didn't use the stemming package specified by knocking. It hasn't been updated since 2010, and now nltk seems more common.

type version
nltk 3.4.5

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

52. Stemming

Take the output of> 51 as input, apply Porter's stemming algorithm, and output the word and stem in tab-delimited format. In Python, use the stemming module as an implementation of Porter's stemming algorithm.

Problem supplement (about "stemming")

"Stemming" is the stem, which refers to the unchanged front part of a word (eg, Natural's stemming is natur). Stemming will be used later in 71st. There are several types of "stemming", and this time I use Porter's algorithm (this seems to be famous). If you want to know more, please check it out by google.

Answer

Answer Program [052. Stemming.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82 % AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 052.% E3% 82% B9% E3% 83% 86% E3% 83% 9F% E3% 83% B3% E3% 82% B0.ipynb)

import re

from nltk.stem.porter import PorterStemmer as PS

ps = PS()

with open('./051.result.txt') as file_in, \
     open('./052.result.txt', 'w') as file_out:
    for token in file_in:
        if token != '\n':
            print(token.rstrip(), '\t', ps.stem(token.rstrip()), file=file_out)

Answer commentary

The program is too short to explain. You can stem it just by doing ps.stem (), and it's very easy to call it.

Output result (execution result)

When the program is executed, the following result is output (excerpt from the first 30 lines).

text:052.result.txt(Excerpt from the first 30 lines)


Natural 	 natur
language 	 languag
processing 	 process
From 	 from
Wikipedia 	 wikipedia
the 	 the
free 	 free
encyclopedia 	 encyclopedia
Natural 	 natur
language 	 languag
processing 	 process
(NLP) 	 (nlp)
is 	 is
a 	 a
field 	 field
of 	 of
computer 	 comput
science 	 scienc
artificial 	 artifici
intelligence 	 intellig
and 	 and
linguistics 	 linguist
concerned 	 concern
with 	 with
the 	 the
interactions 	 interact
between 	 between
computers 	 comput
and 	 and
human 	 human

Recommended Posts

100 Language Processing Knock-52: Stemming
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock-24: Extract File Reference
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)