Introduction

To promote understanding of machine learning A classic way to automatically build a classification model from data, I implemented a naive Bayes classifier. Recently, image analysis such as Cloud Vision API is popular, but The threshold seemed to be high for beginners, so the first step is natural language processing.

This time, we will use the Twitter API to collect quotes from bot accounts and Implemented ** Hanyu classifier ** to classify Hanyu-san and Hanyu-kun.

The API client is implemented in Ruby, and the classifier is implemented in Python. MeCab is used for morphological analysis.

Also, where both should be called Mr. Habu, For convenience, please allow me to refer to figure skater Mr. Habu as Mr. Habu.

Collect Tweet data

APIClient is implemented as follows using Twitter Ruby Gem. The Twitter API has a limit on the number of executions per hour called Rate Limits. If you want to get a lot of tweets, you need to put an interval of about 15 minutes. Please take a coffee break from time to time.

API authentication settings


require 'twitter'

client = Twitter::REST::Client.new do |config|
  config.consumer_key = 'XXX'
  config.consumer_secret = 'YYY'
  config.access_token = 'hoge'
  config.access_token_secret = 'fuga'
end

Get Tweets

client.user_timeline('TwitterUserID',{ count: 150}).each_with_index do |tl,i|
	tw = client.status(tl.id)
	tweet = tw.text
        
	#Eliminate duplication
	if !tweets.include?(tweet)
		puts tweet
	end
end

Morphological analysis

Now, let's process the data acquired earlier. As will be described later, in order to quantify and classify this time ** whether the tweets contain words that Mr. Habu and Mr. Habu are likely to say **, first decompose the collected tweets into part of speech and make them into two tweets. The top 50 words that appear frequently are picked up, for a total of 100 words as variables used for classification. (Actually, there were duplicates, so there are 91 cases.)

Part of speech decomposition of tweets with MeCab

First of all, this time, only ** nouns, verbs, and adjectives ** were counted. (The conjugation of verbs and adjectives has been corrected to the basic form)

Excluded word list

Formal nouns that seem to be unrelated to the person's unique vocabulary, as shown below Probably caused by the word-separation of verbized nouns and adjectives, I have set some lists of words that are not counted.

ng_noun = ["thing", "of", "もof", "It", "When", "、", ",", "。", "¡", "(", ")", "."]
ng_verb = ["To do", "Is", "Become", "is there"]
ng_adjective = ["Yo"]

Counting frequent words

The * collections * package is useful for generating countered lists (tuples). I also used natto for the binding between Python and MeCab.


import collections

from sets import Set
from natto import MeCab

def mostFrequentWords(file, num):
  words = collections.Counter()

  f = open(file)
  line = f.readline()
  while line:
    #noun:surface="skate", feature="noun,General,*,*,*,*,skate,skate,skate"
    #verb:surface="Slip", feature="verb,Independence,*,*,One step,Imperfective form,Slipる,Slave,Slave"
    for node in mecab.parse(line, as_nodes=True):
      features = node.feature.split(",")
    
      if features[0] == "noun" and node.surface not in ng_noun:
        words[node.surface] += 1
      elif features[0] == "verb" and features[6] not in ng_verb:
        words[features[6]] += 1
      elif features[0] == "adjective" and features[6] not in ng_adjective:
        words[features[6]] += 1

    line = f.readline()
    return words.most_common(num


words["hanyu"] = mostFrequentWords("hanyu_train.txt", 50)
words["habu"] = mostFrequentWords("habu_train.txt", 50)

tpl = words["hanyu"] + words["habu"]
vocabulary = set([])
for word in tpl:
  vocabulary.add(word[0])

Naive bayes classifier

Here is a brief explanation of the mathematical background.

Bayes' theorem

First, the Naive Bayes classifier is a probability-based classifier. What I want to ask this time is when a certain document (here, each tweet) * d * is given. Whether it has a high probability of belonging to which class (Mr. Habu or Mr. Habu) * c *. This can be expressed as * P (c | d) * as the conditional probability when a tweet is given. However, since it is difficult to directly obtain this posterior probability, it is calculated using ** Bayes' theorem **.

P(c|d) = \frac{P(c)P(d|c)}{P(d)}

Here, calculate the right side for each class, that is, Mr. Habu / Mr. Habu, Find out which one the tweet is most likely to belong to. However, since the denominator * P (d) * is constant regardless of the class once the classifier is constructed, ** Only the numerator should be calculated **.

P(c) This time, we got 100 tweets for each of Mr. Habu and Mr. Habu, and 70 were training data for building a classifier. 30 cases are used as test data for accuracy verification of the classifier.

P (c) * is the probability of occurrence of that class, so it is a percentage of all training data. Since the number of training data is the same this time,
P (Mr. Habu) * and * P (Mr. Habu) * are both * 70/140 = 0.5 *.

P(d|c) Now, if you think about the meaning of this conditional probability * P (d | c) *, Mr. Habu, Depending on the combination of the types of words that each Habu can say, You have to find the probability that each tweet will occur, but it's impossible. Therefore, it is expressed using a simplified model suitable for document classification. Here, ** About the set of words * V * that Mr. Habu and Mr. Habu are likely to say Consider whether they are included or not included in the classified tweets **.

Multivariable Bernoulli model

The distribution of random variables that take two values, such as saying / not saying, is the ** Bernoulli distribution **.

{P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}}

This exponential part is called a delta function, and 0 is output when * w * = * d *, and 1 is output otherwise. You think it well.

Here we consider the Bernoulli distribution for each word * w * belonging to the set * V *. The ** multivariable Bernoulli model ** represents * P (d | c) *.

\prod_{w \in V}{P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}}

By the way, the following two points can be read from the above as the features of the multivariable Bernoulli model.

--The number of occurrences of words in a document is not taken into account -The phenomenon that the word "does not occur" in the document is emphasized

Maximum likelihood method

In summary, it can be expressed as follows, so calculate the right side for each of Mr. Habu and Mr. Habu. Higher value = ** Determines which assumption is more likely to generate data **. This "product of the probabilities that observation d occurs under hypothesis c" is called ** likelihood **, and the approach to find the most likely c with the maximum likelihood is ** maximum likelihood method **. Is called.

P(D) = {P(c)P(d|c)} = p_c\prod_{w \in V}({P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}})

Since the description of the formula is not the purpose, the formula is folded in the middle, but if it is transformed,

\log P(D) = \sum N_c \log p_c + \sum_c \sum_{w \in V} N_{w,c} \log p_{w,c} + \sum_c \sum_{w \in V} (N_c - N_{w,c}) \log(1 - p_{w,c})

It will be like this. It feels like where the delta function went, but as mentioned above, it has the property of being 0 when * w * = * d *, and 1 otherwise, so the co-occurrence of the word * w * and the class * c * It is expressed by multiplying the number of times. It may seem awkward to read, but the point is that the distribution is determined by two parameters, ** Pw, c, and Pc **.

It would be nice to find c that maximizes the above logP (D) when given the data to be classified. Here, it is assumed that all tweets in the world are written by Mr. Habu or Mr. Habu **, so It is necessary to satisfy the constraint expressed by the following formula that the probability of being classified into each class is added up to 1.

\sum_c p_c = 1

(This is also not the main subject, so I will fold it.) This is called an equation-constrained convex planning problem. For the Lagrange function defined by the method of Lagrange's undetermined multiplier method, The maximum value can be obtained as follows by taking the partial differential of each parameter.


p_{w,c} = \frac {N_{w,c}} {N_c} , p_c = \frac {N_c} {\sum_c N_c}

Classifier implementation

Now that we know how to find the parameters, it's time to implement it.

Data generation

By the way, the following training data was generated from the tweets that were subjected to morphological analysis.


cls = ["habu", "hanyu"]

#It is an image because I can not show it for convenience. As mentioned above, the tweet is generated by morphological analysis.
vocabulary = ["skate", "Plushenko", "Game", "God's move"] 

#Similarly
documents["habu"] = [["Title holder ","70", "Man", "Half", "Hanyu"],[...]]
documents["hanyu"] = [["Great","4 rotations", "Successful", "Winner"],[...]]

Calculation of simultaneous probabilities * p (w, c) *

From the above data and the calculated formula, calculate the simultaneous probability * p (w, c) * that a word will occur for each class.

def train(cls, vocabulary, documents):

  #Number of occurrences of each training document
  n_cls = {}
  total = 0.0
  for c in cls:
    n_cls[c] = len(documents[c])
    total += n_cls[c]

  #Probability of occurrence of each training document
  p_cls = {}
  for c in cls:
    p_cls[c] = n_cls[c] / total

  #Number of occurrences of words for each class
  for c in cls:
    for d in documents[c]:
      for word in vocabulary:
        if word in d:
          n_word[c][word] += 1

　#Probability of word occurrence for each class
  for c in cls:
    p_word[c] = {}
    for word in vocabulary:
      p_word[c][word] = \
        (n_word[c][word] + 1) / (n_cls[c] + 2)

Smoothing

It's a digression. In the part where the probability of occurrence of a word for each class is calculated, 1 is added to the numerator and 2 is added to the denominator. This is when the likelihood is the product of probabilities and the word in the vocabulary * V * happens to not appear in the tweet. This is to prevent the probability of the integration result from becoming 0. (Since it becomes a very small value, it is in the form of a sum by taking the logarithm in implementation. Since 0 does not exist in the logarithmic domain, the program is mossed by math domain error)

Therefore, we usually assume a probability distribution called the Dirichlet distribution, which is difficult to take 0, for the ease of appearance of words. This is called ** smoothing ** because it works to soften the extreme values that tend to occur with maximum likelihood.

Also, this approach that tries to maximize the probability after the data is given by taking into account the prior distribution rather than the rigorous data appearance is called ** MAP estimation **.

Execution result

Now that we have finally built a classifier, let's run it.

Classification function

Using the constructed classifier, the tweets given were Mr. Habu and Mr. Habu, A function that classifies which document was written by.


def classify(data):
  #LogP for each class(D)Seeking
  pp = {}
  for c in cls:
    pp[c] = math.log(p_cls[c])
    for word in vocabulary:
      if word in data:
        pp[c] += math.log(p_word[c][word])

      else:
        pp[c] += math.log((1 - p_word[c][word]))

 #Obtained logP(D)Which of them is the largest
 for c in cls:
   maxpp = maxpp if 'maxpp' in locals() else pp[c]
   maxcls = maxcls if 'maxcls' in locals() else c

   if maxpp < pp[c]:
     maxpp = pp[c]
     maxcls =c
                
   return (maxcls, maxpp)

Model accuracy verification

Of the acquired tweets, let's apply the tweets for 30 x 2 people saved for accuracy verification to the classifier.

def test(data, label):
  i = 0.0
  for tweet in data:
    if nb.classify(tweet)[0] == label:
      i += 1
  return (i / len(data))

# bags_of_words returns a two-dimensional array of part-speech decomposition of each tweet
test(bags_of_words("hanyu_test.txt"), "hanyu")
test(bags_of_words("habu_test.txt"), "habu")

class	① Number of test data	② Number of correct answers	Correct answer rate(②/①)
Mr. Habu	30	28	93.33%
Hanyu-kun	30	28	93.33%

Although it can be determined with fairly high accuracy, It seems that this is because Mr. Habu and Mr. Habu each contained a lot of unique vocabulary. Because it is meaningful to classify data with distributions that have the same vocabulary but different frequencies. In that respect, the test data may not have been very good.

from now on

Next, analyze Image of Mr. Hanyu and Mr. Hanyu I want to try it.

References

Introduction to Machine Learning for Language Processing Difference between Yuzuru Hanyu and Yoshiharu Habu

[PYTHON] I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier