To promote understanding of machine learning A classic way to automatically build a classification model from data, I implemented a naive Bayes classifier. Recently, image analysis such as Cloud Vision API is popular, but The threshold seemed to be high for beginners, so the first step is natural language processing.
This time, we will use the Twitter API to collect quotes from bot accounts and Implemented ** Hanyu classifier ** to classify Hanyu-san and Hanyu-kun.
The API client is implemented in Ruby, and the classifier is implemented in Python. MeCab is used for morphological analysis.
Also, where both should be called Mr. Habu, For convenience, please allow me to refer to figure skater Mr. Habu as Mr. Habu.
APIClient is implemented as follows using Twitter Ruby Gem. The Twitter API has a limit on the number of executions per hour called Rate Limits. If you want to get a lot of tweets, you need to put an interval of about 15 minutes. Please take a coffee break from time to time.
require 'twitter'
client = Twitter::REST::Client.new do |config|
config.consumer_key = 'XXX'
config.consumer_secret = 'YYY'
config.access_token = 'hoge'
config.access_token_secret = 'fuga'
end
client.user_timeline('TwitterUserID',{ count: 150}).each_with_index do |tl,i|
tw = client.status(tl.id)
tweet = tw.text
#Eliminate duplication
if !tweets.include?(tweet)
puts tweet
end
end
Now, let's process the data acquired earlier. As will be described later, in order to quantify and classify this time ** whether the tweets contain words that Mr. Habu and Mr. Habu are likely to say **, first decompose the collected tweets into part of speech and make them into two tweets. The top 50 words that appear frequently are picked up, for a total of 100 words as variables used for classification. (Actually, there were duplicates, so there are 91 cases.)
First of all, this time, only ** nouns, verbs, and adjectives ** were counted. (The conjugation of verbs and adjectives has been corrected to the basic form)
Formal nouns that seem to be unrelated to the person's unique vocabulary, as shown below Probably caused by the word-separation of verbized nouns and adjectives, I have set some lists of words that are not counted.
ng_noun = ["thing", "of", "もof", "It", "When", "、", ",", "。", "¡", "(", ")", "."]
ng_verb = ["To do", "Is", "Become", "is there"]
ng_adjective = ["Yo"]
The * collections * package is useful for generating countered lists (tuples). I also used natto for the binding between Python and MeCab.
import collections
from sets import Set
from natto import MeCab
def mostFrequentWords(file, num):
words = collections.Counter()
f = open(file)
line = f.readline()
while line:
#noun:surface="skate", feature="noun,General,*,*,*,*,skate,skate,skate"
#verb:surface="Slip", feature="verb,Independence,*,*,One step,Imperfective form,Slipる,Slave,Slave"
for node in mecab.parse(line, as_nodes=True):
features = node.feature.split(",")
if features[0] == "noun" and node.surface not in ng_noun:
words[node.surface] += 1
elif features[0] == "verb" and features[6] not in ng_verb:
words[features[6]] += 1
elif features[0] == "adjective" and features[6] not in ng_adjective:
words[features[6]] += 1
line = f.readline()
return words.most_common(num
words["hanyu"] = mostFrequentWords("hanyu_train.txt", 50)
words["habu"] = mostFrequentWords("habu_train.txt", 50)
tpl = words["hanyu"] + words["habu"]
vocabulary = set([])
for word in tpl:
vocabulary.add(word[0])
Here is a brief explanation of the mathematical background.
First, the Naive Bayes classifier is a probability-based classifier. What I want to ask this time is when a certain document (here, each tweet) * d * is given. Whether it has a high probability of belonging to which class (Mr. Habu or Mr. Habu) * c *. This can be expressed as * P (c | d) * as the conditional probability when a tweet is given. However, since it is difficult to directly obtain this posterior probability, it is calculated using ** Bayes' theorem **.
P(c|d) = \frac{P(c)P(d|c)}{P(d)}
Here, calculate the right side for each class, that is, Mr. Habu / Mr. Habu, Find out which one the tweet is most likely to belong to. However, since the denominator * P (d) * is constant regardless of the class once the classifier is constructed, ** Only the numerator should be calculated **.
P(c) This time, we got 100 tweets for each of Mr. Habu and Mr. Habu, and 70 were training data for building a classifier. 30 cases are used as test data for accuracy verification of the classifier.
P(d|c) Now, if you think about the meaning of this conditional probability * P (d | c) *, Mr. Habu, Depending on the combination of the types of words that each Habu can say, You have to find the probability that each tweet will occur, but it's impossible. Therefore, it is expressed using a simplified model suitable for document classification. Here, ** About the set of words * V * that Mr. Habu and Mr. Habu are likely to say Consider whether they are included or not included in the classified tweets **.
The distribution of random variables that take two values, such as saying / not saying, is the ** Bernoulli distribution **.
{P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}}
This exponential part is called a delta function, and 0 is output when * w * = * d *, and 1 is output otherwise. You think it well.
Here we consider the Bernoulli distribution for each word * w * belonging to the set * V *. The ** multivariable Bernoulli model ** represents * P (d | c) *.
\prod_{w \in V}{P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}}
By the way, the following two points can be read from the above as the features of the multivariable Bernoulli model.
--The number of occurrences of words in a document is not taken into account -The phenomenon that the word "does not occur" in the document is emphasized
In summary, it can be expressed as follows, so calculate the right side for each of Mr. Habu and Mr. Habu. Higher value = ** Determines which assumption is more likely to generate data **. This "product of the probabilities that observation d occurs under hypothesis c" is called ** likelihood **, and the approach to find the most likely c with the maximum likelihood is ** maximum likelihood method **. Is called.
P(D) = {P(c)P(d|c)} = p_c\prod_{w \in V}({P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}})
Since the description of the formula is not the purpose, the formula is folded in the middle, but if it is transformed,
\log P(D) = \sum N_c \log p_c + \sum_c \sum_{w \in V} N_{w,c} \log p_{w,c} + \sum_c \sum_{w \in V} (N_c - N_{w,c}) \log(1 - p_{w,c})
It will be like this. It feels like where the delta function went, but as mentioned above, it has the property of being 0 when * w * = * d *, and 1 otherwise, so the co-occurrence of the word * w * and the class * c * It is expressed by multiplying the number of times. It may seem awkward to read, but the point is that the distribution is determined by two parameters, ** Pw, c, and Pc **.
It would be nice to find c that maximizes the above logP (D) when given the data to be classified. Here, it is assumed that all tweets in the world are written by Mr. Habu or Mr. Habu **, so It is necessary to satisfy the constraint expressed by the following formula that the probability of being classified into each class is added up to 1.
\sum_c p_c = 1
(This is also not the main subject, so I will fold it.) This is called an equation-constrained convex planning problem. For the Lagrange function defined by the method of Lagrange's undetermined multiplier method, The maximum value can be obtained as follows by taking the partial differential of each parameter.
p_{w,c} = \frac {N_{w,c}} {N_c} , p_c = \frac {N_c} {\sum_c N_c}
Now that we know how to find the parameters, it's time to implement it.
By the way, the following training data was generated from the tweets that were subjected to morphological analysis.
cls = ["habu", "hanyu"]
#It is an image because I can not show it for convenience. As mentioned above, the tweet is generated by morphological analysis.
vocabulary = ["skate", "Plushenko", "Game", "God's move"]
#Similarly
documents["habu"] = [["Title holder ","70", "Man", "Half", "Hanyu"],[...]]
documents["hanyu"] = [["Great","4 rotations", "Successful", "Winner"],[...]]
From the above data and the calculated formula, calculate the simultaneous probability * p (w, c) * that a word will occur for each class.
def train(cls, vocabulary, documents):
#Number of occurrences of each training document
n_cls = {}
total = 0.0
for c in cls:
n_cls[c] = len(documents[c])
total += n_cls[c]
#Probability of occurrence of each training document
p_cls = {}
for c in cls:
p_cls[c] = n_cls[c] / total
#Number of occurrences of words for each class
for c in cls:
for d in documents[c]:
for word in vocabulary:
if word in d:
n_word[c][word] += 1
#Probability of word occurrence for each class
for c in cls:
p_word[c] = {}
for word in vocabulary:
p_word[c][word] = \
(n_word[c][word] + 1) / (n_cls[c] + 2)
It's a digression. In the part where the probability of occurrence of a word for each class is calculated, 1 is added to the numerator and 2 is added to the denominator. This is when the likelihood is the product of probabilities and the word in the vocabulary * V * happens to not appear in the tweet. This is to prevent the probability of the integration result from becoming 0. (Since it becomes a very small value, it is in the form of a sum by taking the logarithm in implementation. Since 0 does not exist in the logarithmic domain, the program is mossed by math domain error)
Therefore, we usually assume a probability distribution called the Dirichlet distribution, which is difficult to take 0, for the ease of appearance of words. This is called ** smoothing ** because it works to soften the extreme values that tend to occur with maximum likelihood.
Also, this approach that tries to maximize the probability after the data is given by taking into account the prior distribution rather than the rigorous data appearance is called ** MAP estimation **.
Now that we have finally built a classifier, let's run it.
Using the constructed classifier, the tweets given were Mr. Habu and Mr. Habu, A function that classifies which document was written by.
def classify(data):
#LogP for each class(D)Seeking
pp = {}
for c in cls:
pp[c] = math.log(p_cls[c])
for word in vocabulary:
if word in data:
pp[c] += math.log(p_word[c][word])
else:
pp[c] += math.log((1 - p_word[c][word]))
#Obtained logP(D)Which of them is the largest
for c in cls:
maxpp = maxpp if 'maxpp' in locals() else pp[c]
maxcls = maxcls if 'maxcls' in locals() else c
if maxpp < pp[c]:
maxpp = pp[c]
maxcls =c
return (maxcls, maxpp)
Of the acquired tweets, let's apply the tweets for 30 x 2 people saved for accuracy verification to the classifier.
def test(data, label):
i = 0.0
for tweet in data:
if nb.classify(tweet)[0] == label:
i += 1
return (i / len(data))
# bags_of_words returns a two-dimensional array of part-speech decomposition of each tweet
test(bags_of_words("hanyu_test.txt"), "hanyu")
test(bags_of_words("habu_test.txt"), "habu")
class | ① Number of test data | ② Number of correct answers | Correct answer rate(②/①) |
---|---|---|---|
Mr. Habu | 30 | 28 | 93.33% |
Hanyu-kun | 30 | 28 | 93.33% |
Although it can be determined with fairly high accuracy, It seems that this is because Mr. Habu and Mr. Habu each contained a lot of unique vocabulary. Because it is meaningful to classify data with distributions that have the same vocabulary but different frequencies. In that respect, the test data may not have been very good.
Next, analyze Image of Mr. Hanyu and Mr. Hanyu I want to try it.
Introduction to Machine Learning for Language Processing Difference between Yuzuru Hanyu and Yoshiharu Habu