Shows the formula of "Bayes' theorem" regarding probability.
Now, let's think concretely with an example. Now, if you get a positive test B, what is the probability that you actually have disease A?
The above formula is applied to "Bayes' theorem" and is [$ A_1 $: affected, $ A_2
Even if you get a positive test, you have a 3% chance of actually getting the disease, and about 97 out of 100 people are not.
Bayesian statistics have long been taboo because ** subjective probabilities ** are used as the "prior probabilities" that are the premise of calculations, but on the other hand, only such usages have been accepted. It was. So to speak, it is the correct usage of "** Bayes' theorem **".
In this example, both the "prior probability" morbidity and the test positive rate of the affected person are ** objective probabilities based on frequency ** calculated based on the data. In short, if you use ** objective probabilities for prior probabilities **, then "Bayes' theorem" is still correct.
** Subjective probability ** is a numerical value between 0 and 1 based on personal beliefs that expresses the probability that an event will occur. For example, if an individual throws a dice and the individual assumes that he is half as likely to roll a 1, that is the ** subjective probability ** for that person. If you throw the dice repeatedly and the frequency of each roll is 1/6, the probability of each roll is objectively estimated to be 1/6.
Let's take a concrete look at an example.
Similarly, the result of examining the probability that the word "only now deals" is included in junk mail and regular mail from the database is as follows.
There is a 94% chance that an email X is junk, but there is also a 6% chance that it is a regular email, so it is still dangerous to delete it as junk.
Therefore, this ** posterior probability is used as a prior probability **, and the "Bayes' theorem" is newly applied to the word "free invitation".
Further ** Update prior probabilities **, add data and apply "Bayes' theorem" to the word "must make money".
Similarly, applying "Bayes' theorem" to the word "special monitor", the posterior probabilities approach 100% infinitely as follows.
Applying "Bayes' theorem" one after another in this way is called ** Bayes update **. With only four Bayesian updates, there is a approximately 0.0001% chance that an email X is not spam. ** The Bayesian way of thinking is to find posterior probabilities based on features, but strictly speaking, each feature must be independent of each other **. For example, "special monitors" and "free invitations" are thought to influence each other, but it is known that closing your eyes is very effective in practical use.
In short, it is a simplified methodology that ** assumes that all words in a sentence are independent , which is why it is called " naive (= simple) ** Bayesian filter".
The posterior probabilities calculated by "Bayes' theorem" vary greatly depending on the prior probabilities **. In the first place, "Bayes' theorem" has been plucked for a long time because posterior probabilities can be any value due to subjective and arbitrary prior probabilities.
Here, the idea that "if you collect a lot of objective data, the influence of prior probabilities will be small and posterior probabilities will be stable" came out. The claim is that ** the quantity and quality of the data can effectively eliminate the influence of prior probabilities **.
By the way, the ratio of junk mail in all mail can be checked from the database, but if it is not possible to check it due to various restrictions, the prior probability is arbitrarily determined from the subjective impression. I have to decide. So again the junk mail problem.
The table below summarizes the changes in posterior probabilities by repeating Bayesian updates for each of the seven prior probabilities, from "1 per 100 emails (0.01)" to "half of all emails (0.50)". ..
Whether the prior probability is 0.01 or 0.50, the posterior probability exceeds 99% on the fourth update. As the amount of data speaks for itself, we can see that ** objective data growth stabilizes posterior probabilities **.
rt-polaritydata.tar.gz
downloaded to your local PC on Colaboratory.from google.colab import files
files.upload()
.tar.gz
, which is a state in which multiple files are archived into one with the tar command and compressed with the gzip command, so decompress and decompress with! Tar -zxvf
. I will.!tar -zxvf rt-polaritydata.tar.gz
.neg
and .pos
, are a group of data classified as negative and positive, respectively.import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
word_tokenize
does tokenization of words as its name suggests, but the punkt
executed before it is a lexical tokenizer.def format_sentense(sentense):
return {word: True for word in word_tokenize(sentense) }
{}
outside the internal capsule, and use word_tokenize ()
to return the word as a key and the Boolean value True
for each token.#Positive data preprocessing
pos_data = []
with open('rt-polaritydata/rt-polarity.pos', encoding='latin-1') as f:
for line in f:
pos_data.append([format_sentense(line), 'pos'])
#Preprocessing of negative data
neg_data = []
with open('rt-polaritydata/rt-polarity.neg', encoding='latin-1') as f:
for line in f:
neg_data.append([format_sentense(line), 'neg'])
As an example, here is the first comment that is classified as positive.
When this is preprocessed, it will be as follows.
#Acquisition of training data
training_data = pos_data[:4000] + neg_data[:4000]
#Acquisition of evaluation data
testing_data = pos_data[4000:] + neg_data[4000:]
NaiveBayesClassifier
class to generate a model based on training data.from nltk.classify import NaiveBayesClassifier
#Model generation
model = NaiveBayesClassifier.train(training_data)
es1 = "This is a hilarious movie and I would watch it again and again."
es2 = "This is a boring movie and once you see it, you'll have enough."
#Output judgment result
print( es1, '--->', model.classify(format_sentense(es1)) )
print( es2, '--->', model.classify(format_sentense(es2)) )
accuracy ()
calculates the accuracy of the model in the test data specified in the second argument.from nltk.classify.util import accuracy
print('Correct answer probability: ', accuracy(model, testing_data))
Recommended Posts