Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression

What is TF-IDF's evaluation of word relevance?

Up to the last time, I parsed sentences and converted words into feature vectors. However, even if a word exists a lot in a sentence, if it appears a lot in a sentence of any category, the word is not very important in judging the category. When you want to classify a movie review into "positive" and "negative", the word "wow" can often be used in the context of "wow" or "wow", so that's it. Then the negative and positive of the review is difficult to judge. With this kind of feeling, when a word is categorized, the method of increasing the weight of the word if it is important and decreasing it if it is not important is "TF-IDF". TF is called the frequency of occurrence of words, IDF is called the frequency of reverse documents, and the definition is as follows. Assuming that $ n_d $ represents the total number of documents and $ df (t, d) $ represents the number of documents containing the word t, idf(t, d) = log \frac{n_d}{1 + df(t,d)}, tf-idf = tf(t,d) \times idf(t, d)

By the way, the TfidfTransformer class in Python's scikit-learn can implement this relatively easily, and it receives the frequency of occurrence of words from the CountVectorizer used last time and converts it.

tf_idf.py


import numpy as np
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer

count = CountVectorizer()
texts = np.array(["He likes to play the guitar", \
"She likes to play the piano", \
"He likes to play the guitar, and she likes to play the piano"])
bag = count.fit_transform(docs)

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

Output result:
[[ 0.    0.48  0.48  0.37  0.    0.37  0.    0.37  0.37]
 [ 0.    0.    0.    0.37  0.48  0.37  0.48  0.37  0.37]
 [ 0.34  0.26  0.26  0.4   0.26  0.4   0.26  0.4   0.4 ]]

Points to keep in mind when actually analyzing text data

Cleansing text data

In a sentence like the above example, the input does not include extra symbols etc. and can be passed to countVectorizer etc. as it is. However, some text data includes html markup and separator lines, so it is necessary to remove such extra data before starting analysis (text data cleansing). This can be done with Python regular expressions, etc. Regular expression operations in Python

Removal of Stopwords

Regardless of the sentence category, words that often appear in a sentence in a certain language are not very important for classifying sentences, so it is better to look into them before actually performing machine learning. .) In English you can bring stopwords from Python's NLTK library, but in Japanese there is no official library, slothlib /Filter/StopWord/word/Japanese.txt) page is read, the source is analyzed and words are fetched in many cases. First, open the url with a code like this, and then verify the source using a mysterious thing called Beautiful Soup.

ja_stopwords.py


import urllib.request
import bs4
from bs4 import BeautifulSoup

def get_stop_words():
    #stopwords(Frequent words regardless of sentence attributes)Get the stopword list like Japanese in slothlib to exclude.
    #Parse the source with urllib and BeautifulSoup.
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    soup = bs4.BeautifulSoup(urllib.request.urlopen(url).read(), "html.parser")
    ss = str(soup)
    return ss

print(get_stop_words())

Output result:
over there
Per
there
Over there
after
hole
you
that
How many
When
Now
I
...

Training of logistic regression model to classify documents

Then, like this, we will find the features, etc., and actually perform logistic regression analysis on the preprocessed sentences, and classify whether the sentences that are machine learning are positive or negative. There was no affordable source in Japanese (it seems that there are many collections from Twitter, but it is troublesome to mess with AWS for that) Try based on data annotated as negative or positive for English movie reviews. This program is [Python Machine Learning](https://www.amazon.co.jp/Python%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%83 % 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E9% 81% 94% E4% BA% BA% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-impress-top-gear / dp / 4844380605 / ref = sr_1_cc_2? s = aps & ie = UTF8 & qid = 1487516026 & sr = 1-2-catcorr & keywords = python +% E6% A9% I referred to the chapter on natural language processing in the book 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92).

reviews.py


from sklearn.pipeline import Pipeline 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
print('Accuracy: %.3f' % gs_lr_tfidf.best_score_)

First, split df into training data and test data. It may be easier to understand if you imagine that df looks like a table in the data frame in the pandas library. image

In this, ['review'] and ['sentiment'] are the text of the review (this is x) and the label for that review (0,1 indicates positive or negative), and this is y. (I will omit how df actually read the original data and how the data was cleansed ...) After that, I use a class of sklearn called GridSearchCV to tune the optimum parameters for logistic regression. I have created an instance of GridSearchCV called gs_lr_tfidf and trained it with gs_lr_tfidf.fit () using X_train and y_train.

(Tuning hyperparameters with sklearn)

However, it takes a tremendous amount of time to actually do this method ... Therefore, when the data is large, it seems that it is common to do what is called out-of-core learning.

Recommended Posts

Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Python: Natural language processing
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Natural language processing 1 Morphological analysis
[Python] Try to classify ramen shops by natural language processing
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
[Natural language processing] Preprocessing with Japanese
Japanese analysis processing using Janome part1
100 Language Processing Knock Chapter 1 by Python
Logistic regression analysis Self-made with python
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
3. Natural language processing with Python 5-4. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (noun edition)]
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
Japanese language processing by Python3 (5) Ensemble learning of different models by Voting Classifier
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Python: Deep Learning in Natural Language Processing: Basics
100 language processing knock-99 (using pandas): visualization by t-SNE
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
[Language processing 100 knocks 2020] Summary of answer examples by Python
Explanation of the concept of regression analysis using python Part 2
Communication processing by Python
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
Data analysis using Python 0
Model using convolutional neural network in natural language processing
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
RNN_LSTM2 Natural language processing
Building an environment for natural language processing with Python
Explanation of the concept of regression analysis using Python Part 1
100 language processing knock-30 (using pandas): reading morphological analysis results
Explanation of the concept of regression analysis using Python Extra 1
[Natural language processing / NLP] How to easily perform back translation by machine translation in Python
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
Regression analysis in Python
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
Japanese morphological analysis using Janome
Python: Japanese text: Morphological analysis
What is Logistic Regression Analysis?
100 Language Processing Knock-57: Dependency Analysis
Sentiment analysis with Python (word2vec)
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Natural language processing 3 Word continuity
Japanese morphological analysis with Python
Image processing by python (Pillow)
Simple regression analysis in Python
100 language processing knock-56: co-reference analysis
Python: Natural language vector representation
Data analysis using python pandas
Using Python mode in Processing
Natural language processing 2 Word similarity