[PYTHON] 100 language processing knock-72 (using Stanford NLP): feature extraction

This is the 72nd record of Language Processing 100 Knock 2015. It seems that "feature" is read as "feature" instead of "feature", and it seems to be a language processing term (see Wikipedia "feature structure". % A0% E6% 80% A7% E6% A7% 8B% E9% 80% A0)). As a word familiar to those who are doing machine learning, it means "Feature". This time, the text file is read and the lemmas (dictionary heading words) other than the stop words, which are the contents of Last knock (stop word), are extracted as features. I will.

Link Remarks
072_1.Feature extraction(Extraction).ipynb Answerprogram(Extraction)GitHub link
072_2.Feature extraction(analysis).ipynb Answerprogram(analysis)GitHub link
100 amateur language processing knocks:72 I am always indebted to you by knocking 100 language processing
Getting Started with Stanford NLP in Python It was easy to understand the difference from Stanford Core NLP

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
nltk 3.4.5
stanfordnlp 0.2.0
pandas 0.25.3
matplotlib 3.1.1

Task

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

71. Feature extraction

Design your own features that may be useful for polarity analysis, and extract the features from the training data. As a feature, the minimum baseline would be the one with the stop words removed from the review and each word stemmed.

Answer

Answer premise

It says, "The minimum baseline of each word is stemmed," but it uses a lemma instead of stemming. This time, not only the extraction but also what kind of words there are and the frequency distribution is visualized.

Answer program (extraction) [072_1. Feature extraction (extraction) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5 % AD% A6% E7% BF% 92 / 072_1.% E7% B4% A0% E6% 80% A7% E6% 8A% BD% E5% 87% BA (% E6% 8A% BD% E5% 87% BA) ) .ipynb)

First of all, the extraction edition, which is the main subject of this task.

import warnings
import re
from collections import Counter
import csv

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer as PS
import stanfordnlpp

#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))

ps = PS()

#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT',   #Punctuation
           'X',       #Other
           'SYM',     #symbol
           'PART',    #Particle('s etc.)
           'CCONJ',   #conjunction(and etc.)
           'AUX',     #Auxiliary verb(would etc.)
           'PRON',    #Pronoun
           'SCONJ',   #Subordinate conjunction(whether etc.)
           'ADP',     #Preposition(in etc.)
           'NUM'}     #number


#It was slow to specify all the default processors, so narrow down to the minimum
# https://stanfordnlp.github.io/stanfordnlp/processors.html
nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma')

reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')

#Remove leading and trailing symbols Stemming
def remove_symbols(lemma):
    return reg_sym.sub('', lemma)

#Stop word authenticity judgment
def is_stopword(word):
    lemma = remove_symbols(word.lemma)
    return True if lemma in STOP_WORDS \
                  or lemma == '' \
                  or word.upos in EXC_POS \
                  or len(lemma) == 1 \
                  or reg_dit.search(lemma)\
                else False

#Hide warning
warnings.simplefilter('ignore', UserWarning)

lemma = []

with open('./sentiment.txt') as file:
    for i, line in enumerate(file):
        print("\r{0}".format(i), end="")
        
        #The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
        doc = nlp(line[3:])
        for sentence in doc.sentences:
            lemma.extend([ps.stem(remove_symbols(word.lemma)) for word in sentence.words if is_stopword(word) is False])

freq_lemma = Counter(lemma)

with open('./lemma_all.txt', 'w') as f_out:
    writer = csv.writer(f_out, delimiter='\t')
    writer.writerow(['Char', 'Freq'])
    for key, value in freq_lemma.items():
        writer.writerow([key] + [value])

Answer explanation (extraction)

The language processing part of Stanford NLP is slow, ** it takes about an hour **. I didn't want to re-execute by trial and error, so [CSV file](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD % A6% E7% BF% 92 / lemma_all.txt) is downloading the extraction result. By downloading, the extraction result analysis was separated as a program. [Last Stop Word](https://qiita.com/FukuharaYohei/items/60719ddaa47474a9d670#%E5%9B%9E%E7%AD%94%E3%83%97%E3%83%AD%E3%82% B0% E3% 83% A9% E3% 83% A0% E5% AE% 9F% E8% A1% 8C% E7% B7% A8-071_2% E3% 82% B9% E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E5% AE% 9F% E8% A1% 8Cipynb), so I don't have much explanation. Forcibly, in the following part, the warning message was annoying, but it was hidden.

#Hide warning
warnings.simplefilter('ignore', UserWarning)

Answer Program (Analysis) [072_2. Feature Extraction (Analysis) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5 % AD% A6% E7% BF% 92 / 072_2.% E7% B4% A0% E6% 80% A7% E6% 8A% BD% E5% 87% BA (% E5% 88% 86% E6% 9E% 90) ) .ipynb)

As a bonus, the extracted features are easily analyzed.

import pandas as pd
import matplotlib.pyplot as plt

df_feature = pd.read_table('./lemma_all.txt')

sorted = df_feature.sort_values('Freq', ascending=False)

#Top 10 features frequency output
print(sorted.head(10))

#Feature basic statistic output
print(sorted.describe())

#Feature number output in descending order of frequency
uniq_freq = df_feature['Freq'].value_counts()
print(uniq_freq)

#Bar graph display of frequency(>30 times)
uniq_freq[uniq_freq > 30].sort_index().plot.bar(figsize=(12, 10))

#Bar graph display of frequency(30 to 1000 times)
uniq_freq[(uniq_freq > 30) & (uniq_freq < 1000)].sort_index().plot.bar(figsize=(12, 10))

Answer explanation (analysis)

I'm using pandas to process CSV. The features of the top 10 extracted are as follows (the leftmost column is index, so it doesn't matter). Since it is Movie Review data, there are many words such as film and movie.

        Char  Freq
102     film  1801
77      movi  1583
96      make   838
187    stori   540
258     time   504
43   charact   492
79      good   432
231   comedi   414
458     even   392
21      much   388

Looking at the basic statistics, it looks like this. Approximately 12,000 features have been extracted, with an average frequency of 8.9 times.

               Freq
count  12105.000000
mean       8.860140
std       34.019655
min        1.000000
25%        1.000000
50%        2.000000
75%        6.000000
max     1801.000000

Sorting in descending order by frequency for about 12,000 features is as follows, and more than half of the features appear only twice or less.

1     4884
2     1832
3     1053
4      707
5      478
6      349
7      316
8      259
9      182
10     176

The frequency is narrowed down to 31 or more types of features, the frequency is on the X-axis and the number of features is on the Y-axis, and a bar graph is displayed. image.png

Since there are many features with 3 or less appearances and it was difficult to see the bar graph, we will narrow down the features from 1000 or less to 31 or more. image.png

Recommended Posts

100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-25: Template Extraction
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-75 (using scikit-learn): weight of features
100 Amateur Language Processing Knock: 09
100 language processing knock-99 (using pandas): visualization by t-SNE
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 Language Processing with Python Knock 2015
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")