I've been dealing with sentiment analysis for a while, but all of them were based on the "emotion value dictionary". On the other hand, ** emotion value judgment using machine learning ** is also actively performed.
Among them, ** Naive Bayes filter (naive Bayes classifier) **, whose logic is simple and clear and whose practicality is recognized, will be taken up. It is also provided by the major platform NLTK (Natural Language Toolkit) for building natural language processing programs in Python, and it is easy to do.

⑴ Basic knowledge of Bayesian statistics

1. A brief history of Bayesian statistics

The name Bayes is the name of Thomas Bayes (1702-1761), a pastor who lived in England in the 18th century. It is said that he discovered the so-called ** "Bayes' theorem" ** in the 1740s while he was studying mathematics as his hobby. Later, the French mathematician Pierre = Simon Laplace (1749-1827) rediscovered this on his own and defined it in an objective mathematical formula to show its practicality.
However, Laplace himself eventually questioned, and not only was it not used for more than 200 years, but it was especially taboo in the 20th century, and the textbooks of statistics were just apologetic. .. Statistics have traditionally been based on ** objective probabilities **, whereas Bayesian statistics based on ** subjective probabilities ** have never been accepted.
It's a personal experience, but I'm very impressed with the remarks made by the teacher who taught Bayesian statistics. "Bayes'"unhappiness "is that his words AI and machine learning preceded by image alone." In short, the basis of AI and machine learning, which has rapidly become widespread in recent years, is ** Bayes. Statistics **.

2. Bayes' theorem

Shows the formula of "Bayes' theorem" regarding probability.
Now, let's think concretely with an example. Now, if you get a positive test B, what is the probability that you actually have disease A?
The above formula is applied to "Bayes' theorem" and is [$ A_1 $: affected, $ A_2 : unaffected] [ B_1 $: positive, $ B_2 $: negative]. Let's calculate by substituting a typical numerical value.
Even if you get a positive test, you have a 3% chance of actually getting the disease, and about 97 out of 100 people are not.
Bayesian statistics have long been taboo because ** subjective probabilities ** are used as the "prior probabilities" that are the premise of calculations, but on the other hand, only such usages have been accepted. It was. So to speak, it is the correct usage of "** Bayes' theorem **".
In this example, both the "prior probability" morbidity and the test positive rate of the affected person are ** objective probabilities based on frequency ** calculated based on the data. In short, if you use ** objective probabilities for prior probabilities **, then "Bayes' theorem" is still correct.
** Subjective probability ** is a numerical value between 0 and 1 based on personal beliefs that expresses the probability that an event will occur. For example, if an individual throws a dice and the individual assumes that he is half as likely to roll a 1, that is the ** subjective probability ** for that person. If you throw the dice repeatedly and the frequency of each roll is 1/6, the probability of each roll is objectively estimated to be 1/6.

3. Bayesian updating

Let's take a concrete look at an example.
Similarly, the result of examining the probability that the word "only now deals" is included in junk mail and regular mail from the database is as follows.
There is a 94% chance that an email X is junk, but there is also a 6% chance that it is a regular email, so it is still dangerous to delete it as junk.
Therefore, this ** posterior probability is used as a prior probability **, and the "Bayes' theorem" is newly applied to the word "free invitation".
Further ** Update prior probabilities **, add data and apply "Bayes' theorem" to the word "must make money".
Similarly, applying "Bayes' theorem" to the word "special monitor", the posterior probabilities approach 100% infinitely as follows.
Applying "Bayes' theorem" one after another in this way is called ** Bayes update **. With only four Bayesian updates, there is a approximately 0.0001% chance that an email X is not spam. ** The Bayesian way of thinking is to find posterior probabilities based on features, but strictly speaking, each feature must be independent of each other **. For example, "special monitors" and "free invitations" are thought to influence each other, but it is known that closing your eyes is very effective in practical use.
In short, it is a simplified methodology that ** assumes that all words in a sentence are independent , which is why it is called " naive (= simple) ** Bayesian filter".

4. Stabilization of posterior probabilities by increasing data

The posterior probabilities calculated by "Bayes' theorem" vary greatly depending on the prior probabilities **. In the first place, "Bayes' theorem" has been plucked for a long time because posterior probabilities can be any value due to subjective and arbitrary prior probabilities.
Here, the idea that "if you collect a lot of objective data, the influence of prior probabilities will be small and posterior probabilities will be stable" came out. The claim is that ** the quantity and quality of the data can effectively eliminate the influence of prior probabilities **.
By the way, the ratio of junk mail in all mail can be checked from the database, but if it is not possible to check it due to various restrictions, the prior probability is arbitrarily determined from the subjective impression. I have to decide. So again the junk mail problem.
The table below summarizes the changes in posterior probabilities by repeating Bayesian updates for each of the seven prior probabilities, from "1 per 100 emails (0.01)" to "half of all emails (0.50)". ..
Whether the prior probability is 0.01 or 0.50, the posterior probability exceeds 99% on the fourth update. As the amount of data speaks for itself, we can see that ** objective data growth stabilizes posterior probabilities **.

⑵ Data set overview

Uses data from movie reviews conducted for sentiment analysis at Cornell University, known as the most difficult university in the United States. Official Site: Movie Review Data
Contains 5,331 comments each, which are manually classified as positive/negative.
You can download it automatically from the link on the right. http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz

(3) Data set capture

Upload the compressed file rt-polaritydata.tar.gz downloaded to your local PC on Colaboratory.

from google.colab import files
files.upload()

The file extension is .tar.gz, which is a state in which multiple files are archived into one with the tar command and compressed with the gzip command, so decompress and decompress with! Tar -zxvf. I will.

!tar -zxvf rt-polaritydata.tar.gz

A total of 3 extracted files, aside from the README file, the last 2 files, .neg and .pos, are a group of data classified as negative and positive, respectively.

⑷ Data preprocessing

1. Import of NLTK

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

Import the NLTK package and the tokenizer that splits the text into tokens. A token is the smallest unit of "cohesion" that has one meaning and has no meaning if it is made finer.
By the way, word_tokenize does tokenization of words as its name suggests, but the punkt executed before it is a lexical tokenizer.

2. Define a method to preprocess data

def format_sentense(sentense):
    return {word: True for word in word_tokenize(sentense) }

Specify the dict type with {} outside the internal capsule, and use word_tokenize () to return the word as a key and the Boolean value True for each token.

3. Data preprocessing

#Positive data preprocessing
pos_data = []
with open('rt-polaritydata/rt-polarity.pos', encoding='latin-1') as f:
    for line in f:
        pos_data.append([format_sentense(line), 'pos'])

#Preprocessing of negative data
neg_data = []
with open('rt-polaritydata/rt-polarity.neg', encoding='latin-1') as f:
    for line in f:
        neg_data.append([format_sentense(line), 'neg'])

As an example, here is the first comment that is classified as positive.
When this is preprocessed, it will be as follows.

4. Divide into training data and verification data

Generate a model with a total of 8,000 sentences, including 4,000 positive and negative sentences, as training data, and evaluate the accuracy of the model using a total of 2,662 sentences, including the remaining 1,331 positive and negative sentences.

#Acquisition of training data
training_data = pos_data[:4000] + neg_data[:4000]

#Acquisition of evaluation data
testing_data = pos_data[4000:] + neg_data[4000:]

⑸ Model generation based on training data

1. Model generation

Use NLTK's NaiveBayesClassifier class to generate a model based on training data.

from nltk.classify import NaiveBayesClassifier

#Model generation
model = NaiveBayesClassifier.train(training_data)

2. Model trial

Let's try two example sentences to see how the model determines.

es1 = "This is a hilarious movie and I would watch it again and again."
es2 = "This is a boring movie and once you see it, you'll have enough."

#Output judgment result
print( es1, '--->', model.classify(format_sentense(es1)) )
print( es2, '--->', model.classify(format_sentense(es2)) )

⑹ Evaluation of model accuracy using test data

The function accuracy () calculates the accuracy of the model in the test data specified in the second argument.
Specifically, it measures the percentage that the model correctly labels. For example, if you have 80 test data, a model that makes 60 correct predictions is 75% accurate.

from nltk.classify.util import accuracy

print('Correct answer probability: ', accuracy(model, testing_data))

The judgment result for the previous example sentence is also good, and it is evaluated as having an accuracy of less than 80% as a model.

4. Bayesian statistics in Python 1-1. Emotional judgment by naive Bayes [Bayes' theorem]

⑴ Basic knowledge of Bayesian statistics

1. A brief history of Bayesian statistics

2. Bayes' theorem

3. Bayesian updating

4. Stabilization of posterior probabilities by increasing data

⑵ Data set overview

(3) Data set capture

⑷ Data preprocessing

1. Import of NLTK

2. Define a method to preprocess data

3. Data preprocessing

4. Divide into training data and verification data

⑸ Model generation based on training data

1. Model generation

2. Model trial

⑹ Evaluation of model accuracy using test data