[PYTHON] 100 Language Processing Knock-71 (using Stanford NLP): Stopword

This is the 71st record of Language Processing 100 Knock 2015. This time I'm using the nltk package and stanford NLP package to exclude stopwords. I will. A simple stopword dictionary is obtained from the nltk package, and symbols are also judged by part of speech. Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use Stanford NLP.

Reference link

Link Remarks
071_1.Stop word(Preparation).ipynb Answerprogram(Preparation編)GitHub link
071_2.Stop word(Run).ipynb Answerprogram(Run編)GitHub link
100 amateur language processing knocks:71 I am always indebted to you by knocking 100 language processing
Getting Started with Stanford NLP in Python It was easy to understand the difference from Stanford Core NLP

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
nltk 3.4.5
stanfordnlp 0.2.0

Task

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

71. Stop word

Create an appropriate list of English stop words (stop list). Furthermore, implement a function that returns true if the word (character string) given as an argument is included in the stop list, and false otherwise. In addition, write a test for that function.

** "Appropriately" **?

Answer

Answer premise

I wondered what to do with the ** "appropriately" ** of the assignment. As a result, we decided to use the stop words defined in the nltk package and the part-of-speech information of the morphological analysis results to determine the authenticity.

Answer Program (Preparation) [071_1. Stopword (Preparation) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5 % AD% A6% E7% BF% 92 / 071_1.% E3% 82% B9% E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89 (% E6% BA% 96% E5% 82% 99) .ipynb)

First of all, there is preparation. This is separate from running the answer and only needs to be run once after installing the package. Download the stopword list for the ntlk package. This is done first, separate from pip install.

import nltk

#Download Stopword
nltk.download('stopwords')

#Stop word confirmation
print(nltk.corpus.stopwords.words('english'))

Also download the English model of the stanford NLP package. Please note that it is about 250MB. This is also done first, separate from pip install.

import stanfordnlp

stanfordnlp.download('en')

stanfordnlp.Pipeline()

Answer Program (Execution) [071_2. Stopword (Execution) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5 % AD% A6% E7% BF% 92/071_2.% E3% 82% B9% E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89 (% E5% AE% 9F% E8% A1% 8C) .ipynb)

import re

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer as PS
import stanfordnlp

#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))

ps = PS()

#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT',   #Punctuation
           'X',       #Other
           'SYM',     #symbol
           'PART',    #Particle('s etc.)
           'CCONJ',   #conjunction(and etc.)
           'AUX',     #Auxiliary verb(would etc.)
           'PRON',    #Pronoun
           'SCONJ',   #Subordinate conjunction(whether etc.)
           'ADP',     #Preposition(in etc.)
           'NUM'}     #number

#It was slow to specify all the default processors, so narrow down to the minimum
# https://stanfordnlp.github.io/stanfordnlp/processors.html
nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma')

reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')

#Remove leading and trailing symbols
def remove_symbols(lemma):
    return reg_sym.sub('', lemma)

#Stop word authenticity judgment
def is_stopword(word):
    lemma = remove_symbols(word.lemma)
    return True if lemma in STOP_WORDS \
                  or lemma == '' \
                  or word.upos in EXC_POS \
                  or len(lemma) == 1 \
                  or reg_dit.search(lemma)\
                else False

#Judge 3 sentences as a trial
with open('./sentiment.txt') as file:
    for i, line in enumerate(file):
        
        #The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
        doc = nlp(line[3:])
        print(i, line)
        for sentence in doc.sentences:
            for word in sentence.words:
                print(word.text, word.upos, remove_symbols(word.lemma), ps.stem(remove_symbols(word.lemma)), is_stopword(word))
        
        if i == 2:
            break

Answer commentary

This time, not only simple stop word exclusion, but also morphological analysis and particles are excluded. First, we get the stopwords in tuple format.

#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))

These are the contents of the stop word.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In addition, the following definitions are defined as part of speech that is not used. I saw the results later and it increased steadily. What makes me happy with this process is that, for example, like in" I ** like ** this movie "is not subject to stopwords as a verb, but it is like" he is ** like ** my hero ". In that case, like is excluded as ADP (preposition). The types here are similar to those defined as Universal POS tabs.

#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT',   #Punctuation
           'X',       #Other
           'SYM',     #symbol
           'PART',    #Particle('s etc.)
           'CCONJ',   #conjunction(and etc.)
           'AUX',     #Auxiliary verb(would etc.)
           'PRON',    #Pronoun
           'SCONJ',   #Subordinate conjunction(whether etc.)
           'ADP',     #Preposition(in etc.)
           'NUM'}     #number

Compile the regular expression to be used later. The first line is a regular expression that searches for half-width symbols from the beginning and end. The second line is a regular expression that looks for numbers.

reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')

A function that removes half-width symbols from the beginning and end. For example, when there is a character like -a, the first character is removed.

#Remove leading and trailing symbols
def remove_symbols(lemma):
    return reg_sym.sub('', lemma)

The essential function is defined below. lemma is called a lemma and is converted to the format defined in the dictionary as in Lemmatisation (eg better-> good). It is judged as a stop word in the following cases.

  1. True if the lemma is included in the stopword
  2. True if the lemma is blank (the lemma here is stripped if it starts and ends with a symbol, so if everything is a symbol)
  3. True if the part of speech does not seem to be related to the emotions defined above
  4. True if the character length is 1.
  5. True if it contains numbers (12th etc. are included here)
#Stop word authenticity judgment
def is_stopword(word):
    lemma = remove_symbols(word.lemma)
    return True if lemma in STOP_WORDS \
                  or lemma == '' \
                  or word.upos in EXC_POS \
                  or len(lemma) == 1 \
                  or reg_dit.search(lemma)\
                else False

After that, the file is read and the stop word is judged. Since stanfordnlp is slow, we have excluded the first three letters of negative and positive to make it as fast as possible. This time, I'm trying to execute only the first three sentences. Finally, it is output in a stemmed form using ps.stem. This is to make the three words adhere adherence adherent, for example, common as adher. In the subsequent machine learning part, I think that this form is better and use it.

with open('./sentiment.txt') as file:
    for i, line in enumerate(file):
        
        #The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
        doc = nlp(line[3:])
        print(i, line)
        for sentence in doc.sentences:
            for word in sentence.words:
                print(word.text, word.upos, remove_symbols(word.lemma), ps.stem(remove_symbols(word.lemma)), is_stopword(word))
        
        if i == 2:
            break

The execution result looks like this.

0 +1 a chick flick for guys .

a DET a a True
chick NOUN chick chick False
flick NOUN flick flick False
for ADP for for True
guys NOUN guy guy False
. PUNCT   True
1 +1 an impressive if flawed effort that indicates real talent .

an DET a a True
impressive ADJ impressive impress False
if SCONJ if if True
flawed VERB flaw flaw False
effort NOUN effort effort False
that PRON that that True
indicates VERB indicate indic False
real ADJ real real False
talent NOUN talent talent False
. PUNCT   True
2 +1 displaying about equal amounts of naiveté , passion and talent , beneath clouds establishes sen as a filmmaker of considerable potential .

displaying VERB displaying display False
about ADP about about True
equal ADJ equal equal False
amounts NOUN amount amount False
of ADP of of True
naiveté NOUN naiveté naiveté False
, PUNCT   True
passion NOUN passion passion False
and CCONJ and and True
talent NOUN talent talent False
, PUNCT   True
beneath ADP beneath beneath True
clouds NOUN cloud cloud False
establishes VERB establish establish False
sen NOUN sen sen False
as ADP as as True
a DET a a True
filmmaker NOUN filmmaker filmmak False
of ADP of of True
considerable ADJ considerable consider False
potential NOUN potential potenti False
. PUNCT   True

Recommended Posts

100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 language processing knock-76 (using scikit-learn): labeling
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing with Python Knock 2015
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)