[PYTHON] I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier

Introduction

To promote understanding of machine learning A classic way to automatically build a classification model from data, I implemented a naive Bayes classifier. Recently, image analysis such as Cloud Vision API is popular, but The threshold seemed to be high for beginners, so the first step is natural language processing.

This time, we will use the Twitter API to collect quotes from bot accounts and Implemented ** Hanyu classifier ** to classify Hanyu-san and Hanyu-kun.

The API client is implemented in Ruby, and the classifier is implemented in Python. MeCab is used for morphological analysis.

Also, where both should be called Mr. Habu, For convenience, please allow me to refer to figure skater Mr. Habu as Mr. Habu.

Collect Tweet data

APIClient is implemented as follows using Twitter Ruby Gem. The Twitter API has a limit on the number of executions per hour called Rate Limits. If you want to get a lot of tweets, you need to put an interval of about 15 minutes. Please take a coffee break from time to time.

API authentication settings


require 'twitter'

client = Twitter::REST::Client.new do |config|
  config.consumer_key = 'XXX'
  config.consumer_secret = 'YYY'
  config.access_token = 'hoge'
  config.access_token_secret = 'fuga'
end

Get Tweets

client.user_timeline('TwitterUserID',{ count: 150}).each_with_index do |tl,i|
	tw = client.status(tl.id)
	tweet = tw.text
        
	#Eliminate duplication
	if !tweets.include?(tweet)
		puts tweet
	end
end

Morphological analysis

Now, let's process the data acquired earlier. As will be described later, in order to quantify and classify this time ** whether the tweets contain words that Mr. Habu and Mr. Habu are likely to say **, first decompose the collected tweets into part of speech and make them into two tweets. The top 50 words that appear frequently are picked up, for a total of 100 words as variables used for classification. (Actually, there were duplicates, so there are 91 cases.)

Part of speech decomposition of tweets with MeCab

First of all, this time, only ** nouns, verbs, and adjectives ** were counted. (The conjugation of verbs and adjectives has been corrected to the basic form)

Excluded word list

Formal nouns that seem to be unrelated to the person's unique vocabulary, as shown below Probably caused by the word-separation of verbized nouns and adjectives, I have set some lists of words that are not counted.

ng_noun = ["thing", "of", "もof", "It", "When", "、", ",", "。", "¡", "(", ")", "."]
ng_verb = ["To do", "Is", "Become", "is there"]
ng_adjective = ["Yo"]

Counting frequent words

The * collections * package is useful for generating countered lists (tuples). I also used natto for the binding between Python and MeCab.


import collections

from sets import Set
from natto import MeCab

def mostFrequentWords(file, num):
  words = collections.Counter()

  f = open(file)
  line = f.readline()
  while line:
    #noun:surface="skate", feature="noun,General,*,*,*,*,skate,skate,skate"
    #verb:surface="Slip", feature="verb,Independence,*,*,One step,Imperfective form,Slipる,Slave,Slave"
    for node in mecab.parse(line, as_nodes=True):
      features = node.feature.split(",")
    
      if features[0] == "noun" and node.surface not in ng_noun:
        words[node.surface] += 1
      elif features[0] == "verb" and features[6] not in ng_verb:
        words[features[6]] += 1
      elif features[0] == "adjective" and features[6] not in ng_adjective:
        words[features[6]] += 1

    line = f.readline()
    return words.most_common(num


words["hanyu"] = mostFrequentWords("hanyu_train.txt", 50)
words["habu"] = mostFrequentWords("habu_train.txt", 50)

tpl = words["hanyu"] + words["habu"]
vocabulary = set([])
for word in tpl:
  vocabulary.add(word[0])

Naive bayes classifier

Here is a brief explanation of the mathematical background.

Bayes' theorem

First, the Naive Bayes classifier is a probability-based classifier. What I want to ask this time is when a certain document (here, each tweet) * d * is given. Whether it has a high probability of belonging to which class (Mr. Habu or Mr. Habu) * c *. This can be expressed as * P (c | d) * as the conditional probability when a tweet is given. However, since it is difficult to directly obtain this posterior probability, it is calculated using ** Bayes' theorem **.

P(c|d) = \frac{P(c)P(d|c)}{P(d)}

Here, calculate the right side for each class, that is, Mr. Habu / Mr. Habu, Find out which one the tweet is most likely to belong to. However, since the denominator * P (d) * is constant regardless of the class once the classifier is constructed, ** Only the numerator should be calculated **.

P(c) This time, we got 100 tweets for each of Mr. Habu and Mr. Habu, and 70 were training data for building a classifier. 30 cases are used as test data for accuracy verification of the classifier.

P(d|c) Now, if you think about the meaning of this conditional probability * P (d | c) *, Mr. Habu, Depending on the combination of the types of words that each Habu can say, You have to find the probability that each tweet will occur, but it's impossible. Therefore, it is expressed using a simplified model suitable for document classification. Here, ** About the set of words * V * that Mr. Habu and Mr. Habu are likely to say Consider whether they are included or not included in the classified tweets **.

Multivariable Bernoulli model

The distribution of random variables that take two values, such as saying / not saying, is the ** Bernoulli distribution **.

{P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}}

This exponential part is called a delta function, and 0 is output when * w * = * d *, and 1 is output otherwise. You think it well.

Here we consider the Bernoulli distribution for each word * w * belonging to the set * V *. The ** multivariable Bernoulli model ** represents * P (d | c) *.

\prod_{w \in V}{P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}}

By the way, the following two points can be read from the above as the features of the multivariable Bernoulli model.

--The number of occurrences of words in a document is not taken into account -The phenomenon that the word "does not occur" in the document is emphasized

Maximum likelihood method

In summary, it can be expressed as follows, so calculate the right side for each of Mr. Habu and Mr. Habu. Higher value = ** Determines which assumption is more likely to generate data **. This "product of the probabilities that observation d occurs under hypothesis c" is called ** likelihood **, and the approach to find the most likely c with the maximum likelihood is ** maximum likelihood method **. Is called.

P(D) = {P(c)P(d|c)} = p_c\prod_{w \in V}({P_{w,c}}^{\delta_{w,d}}(1-{P_{w,c}})^{1-\delta_{w,d}})

Since the description of the formula is not the purpose, the formula is folded in the middle, but if it is transformed,

\log P(D) = \sum N_c \log p_c + \sum_c \sum_{w \in V} N_{w,c} \log p_{w,c} + \sum_c \sum_{w \in V} (N_c - N_{w,c}) \log(1 - p_{w,c})

It will be like this. It feels like where the delta function went, but as mentioned above, it has the property of being 0 when * w * = * d *, and 1 otherwise, so the co-occurrence of the word * w * and the class * c * It is expressed by multiplying the number of times. It may seem awkward to read, but the point is that the distribution is determined by two parameters, ** Pw, c, and Pc **.

It would be nice to find c that maximizes the above logP (D) when given the data to be classified. Here, it is assumed that all tweets in the world are written by Mr. Habu or Mr. Habu **, so It is necessary to satisfy the constraint expressed by the following formula that the probability of being classified into each class is added up to 1.

\sum_c p_c = 1

(This is also not the main subject, so I will fold it.) This is called an equation-constrained convex planning problem. For the Lagrange function defined by the method of Lagrange's undetermined multiplier method, The maximum value can be obtained as follows by taking the partial differential of each parameter.


p_{w,c} = \frac {N_{w,c}} {N_c} , p_c = \frac {N_c} {\sum_c N_c}

Classifier implementation

Now that we know how to find the parameters, it's time to implement it.

Data generation

By the way, the following training data was generated from the tweets that were subjected to morphological analysis.


cls = ["habu", "hanyu"]

#It is an image because I can not show it for convenience. As mentioned above, the tweet is generated by morphological analysis.
vocabulary = ["skate", "Plushenko", "Game", "God's move"] 

#Similarly
documents["habu"] = [["Title holder ","70", "Man", "Half", "Hanyu"],[...]]
documents["hanyu"] = [["Great","4 rotations", "Successful", "Winner"],[...]]

Calculation of simultaneous probabilities * p (w, c) *

From the above data and the calculated formula, calculate the simultaneous probability * p (w, c) * that a word will occur for each class.

def train(cls, vocabulary, documents):

  #Number of occurrences of each training document
  n_cls = {}
  total = 0.0
  for c in cls:
    n_cls[c] = len(documents[c])
    total += n_cls[c]

  #Probability of occurrence of each training document
  p_cls = {}
  for c in cls:
    p_cls[c] = n_cls[c] / total

  #Number of occurrences of words for each class
  for c in cls:
    for d in documents[c]:
      for word in vocabulary:
        if word in d:
          n_word[c][word] += 1

 #Probability of word occurrence for each class
  for c in cls:
    p_word[c] = {}
    for word in vocabulary:
      p_word[c][word] = \
        (n_word[c][word] + 1) / (n_cls[c] + 2)

Smoothing

It's a digression. In the part where the probability of occurrence of a word for each class is calculated, 1 is added to the numerator and 2 is added to the denominator. This is when the likelihood is the product of probabilities and the word in the vocabulary * V * happens to not appear in the tweet. This is to prevent the probability of the integration result from becoming 0. (Since it becomes a very small value, it is in the form of a sum by taking the logarithm in implementation. Since 0 does not exist in the logarithmic domain, the program is mossed by math domain error)

Therefore, we usually assume a probability distribution called the Dirichlet distribution, which is difficult to take 0, for the ease of appearance of words. This is called ** smoothing ** because it works to soften the extreme values that tend to occur with maximum likelihood.

Also, this approach that tries to maximize the probability after the data is given by taking into account the prior distribution rather than the rigorous data appearance is called ** MAP estimation **.

Execution result

Now that we have finally built a classifier, let's run it.

Classification function

Using the constructed classifier, the tweets given were Mr. Habu and Mr. Habu, A function that classifies which document was written by.


def classify(data):
  #LogP for each class(D)Seeking
  pp = {}
  for c in cls:
    pp[c] = math.log(p_cls[c])
    for word in vocabulary:
      if word in data:
        pp[c] += math.log(p_word[c][word])

      else:
        pp[c] += math.log((1 - p_word[c][word]))

 #Obtained logP(D)Which of them is the largest
 for c in cls:
   maxpp = maxpp if 'maxpp' in locals() else pp[c]
   maxcls = maxcls if 'maxcls' in locals() else c

   if maxpp < pp[c]:
     maxpp = pp[c]
     maxcls =c
                
   return (maxcls, maxpp)

Model accuracy verification

Of the acquired tweets, let's apply the tweets for 30 x 2 people saved for accuracy verification to the classifier.

def test(data, label):
  i = 0.0
  for tweet in data:
    if nb.classify(tweet)[0] == label:
      i += 1
  return (i / len(data))

# bags_of_words returns a two-dimensional array of part-speech decomposition of each tweet
test(bags_of_words("hanyu_test.txt"), "hanyu")
test(bags_of_words("habu_test.txt"), "habu")
class ① Number of test data ② Number of correct answers Correct answer rate(②/①)
Mr. Habu 30 28 93.33%
Hanyu-kun 30 28 93.33%

Although it can be determined with fairly high accuracy, It seems that this is because Mr. Habu and Mr. Habu each contained a lot of unique vocabulary. Because it is meaningful to classify data with distributions that have the same vocabulary but different frequencies. In that respect, the test data may not have been very good.

from now on

Next, analyze Image of Mr. Hanyu and Mr. Hanyu I want to try it.

References

Introduction to Machine Learning for Language Processing Difference between Yuzuru Hanyu and Yoshiharu Habu

Recommended Posts

I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier
I tried natural language processing with transformers.
I tried to judge Tsundere with Naive Bayes
I tried to extract named entities with the natural language processing library GiNZA
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I tried natural number expression and arithmetic processing only with list processing
[Python] I played with natural language processing ~ transformers ~
I tried to read and save automatically with VOICEROID2 2
I tried to implement and learn DCGAN with PyTorch
I tried to automatically read and save with VOICEROID2
I tried to implement Grad-CAM with keras and tensorflow
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried 100 language processing knock 2020
I tried to predict and submit Titanic survivors with Kaggle
I tried to illustrate the time and time in C language
[Python] Try to classify ramen shops by natural language processing
I tried to classify MNIST by GNN (with PyTorch geometric)
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to divide with a deep learning language model
[Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Study natural language processing with Kikagaku
processing to use notMNIST data in Python (and tried to classify it)
I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
[Natural language processing] Preprocessing with Japanese
I tried 100 language processing knock 2020: Chapter 3
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
Artificial language Lojban and natural language processing (artificial language processing)
I tried to create serverless batch processing for the first time with DynamoDB and Step Functions
I tried to make a periodical process with Selenium and Python
I tried to create Bulls and Cows with a shell program
I will write a detailed explanation to death while solving 100 natural language processing knock 2020 with Python
I tried to easily detect facial landmarks with python and dlib
I tried 100 language processing knock 2020: Chapter 1
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
Preparing to start natural language processing
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to express sadness and joy with the stable marriage problem.
Teaching machines good and evil with a naive Bayes classifier (draft stage)
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to convert datetime <-> string with tzinfo using strftime () and strptime ()
I tried to learn the angle from sin and cos with chainer
I tried to create CSV upload, data processing, download function with Django
I tried to classify Oba Hana and Emiri Otani by deep learning
[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract
I tried to control the network bandwidth and delay with the tc command
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
I tried to implement Autoencoder with TensorFlow
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried to classify text using TensorFlow
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
I tried a functional language with Python
I tried to implement CVAE with PyTorch
I tried to solve TSP with QAOA
Introduction to AI creation with Python! Part 3 I tried to classify and predict images with a convolutional neural network (CNN)