Overview

To study machine learning algorithms When I read "Collective Intelligence Programming" (ISBN-13: 978-4873113647) I wasn't sure what the formulas in the program were doing, so Make a note of what you have looked up.

~~ It was heavy math, so if I read it lightly, I might not have to follow it ... ~~

I studied statistics again after a long time, so There may be many mistakes. I would appreciate it if you could point out.

Chapter 6 Document Filtering

6.6.2 Integrate probabilities

――This section is a summary of email spam judgment --Assuming that each keyword (token) has a spam-like probability. --From the probability of each keyword, use the Fisher method to calculate whether it is spam when viewed as a whole document

Algorithm ⇒ The author has published on github. I will leave a note of the points that were difficult to understand personally ((1), (2) and (3) below).

  def fisherprob(self,item,cat):
    p=1
    features=self.getfeatures(item)
    for f in features:
      p*=(self.weightedprob(f,cat,self.cprob))
    
    # (1)・ ・ ・ What is fscore calculated?
    fscore=-2*math.log(p)
    
    return self.invchi2(fscore,len(features)*2)

  # (2)・ ・ ・ What is invchi2 calculating?
  def invchi2(self,chi, df):
    # (3)・ ・ ・ Can the inverse function be calculated?
    m = chi / 2.0
    sum = term = math.exp(-m)
    for i in range(1, df//2):
        term *= m / i
        sum += term
    return min(sum, 1.0)

-(1) What is fscore calculated? -According to fisher's method, multiply by k independent p-values (like probabilities) and log If you take and multiply by -2, you can calculate the p-value of the χ-square distribution with the degree of freedom 2k. ―― ~~ I haven't followed the proof yet ~~ --For p-value, please refer to Statistical method χ-square test. -[Click here for details on the chi-square distribution](https://ja.wikipedia.org/wiki/%E3%82%AB%E3%82%A4%E4%BA%8C%E4%B9%97%E5% 88% 86% E5% B8% 83)

-(2) What is ʻinvchi2calculating? --Inverse function of the cumulative distribution function of the chi-square distribution --The chi-square test often finds thep-value and makes a judgment based on the case where it is less than 0.05 (the chi-square value is sufficiently high), but it has a high score in the theme of spam classification. = It seems that the inverse function is used to return to the χ-square value (value that follows the χ-square distribution) because we want to make it spam. --Arguments -- chi・・・p-value ( 0 ≤ chi ≤ 1) --df・・・ χ-square distribution degrees of freedom --Output --Chi-square value (value according to the chi-square distribution). However, it does not exceed1. -(3) Can the inverse function be calculated? --As you can see from the cumulative density function of the chi-square distribution, the inverse function cannot be calculated. --To calculate the inverse function, [Inverse Function Theorem](https://ja.wikipedia.org/wiki/%E9%80%86%E5%87%BD%E6%95%B0%E5%AE% It seems that 9A% E7% 90% 86) is used to bring it into a differential equation, and a power series (like an nth-order equation) is used to obtain an approximate solution. ――It seems that the method of calculating the inverse function of the cumulative distribution function with high accuracy is being researched every day, so it seems that it was implemented by an algorithm that adopted an accurate calculation method. ――You can search various methods by searching the paper with keywords such as Quantile Chi-Square. --ʻInverse Chi-Square is not recommended as it will hit something similar and non-similar, the inverse chi-square distribution.

Related information memo

--Is the Fisher method called the Robinson-Fisher method in Japanese papers? --The Fisher method is obtained from the Robinson method.

[PYTHON] Understanding memo of collective intelligence programming

Overview

Chapter 6 Document Filtering

6.6.2 Integrate probabilities

Related information memo