To study machine learning algorithms When I read "Collective Intelligence Programming" (ISBN-13: 978-4873113647) I wasn't sure what the formulas in the program were doing, so Make a note of what you have looked up.
~~ It was heavy math, so if I read it lightly, I might not have to follow it ... ~~
I studied statistics again after a long time, so There may be many mistakes. I would appreciate it if you could point out.
――This section is a summary of email spam judgment --Assuming that each keyword (token) has a spam-like probability. --From the probability of each keyword, use the Fisher method to calculate whether it is spam when viewed as a whole document
Algorithm ⇒ The author has published on github.
I will leave a note of the points that were difficult to understand personally ((1)
, (2)
and (3)
below).
def fisherprob(self,item,cat):
p=1
features=self.getfeatures(item)
for f in features:
p*=(self.weightedprob(f,cat,self.cprob))
# (1)・ ・ ・ What is fscore calculated?
fscore=-2*math.log(p)
return self.invchi2(fscore,len(features)*2)
# (2)・ ・ ・ What is invchi2 calculating?
def invchi2(self,chi, df):
# (3)・ ・ ・ Can the inverse function be calculated?
m = chi / 2.0
sum = term = math.exp(-m)
for i in range(1, df//2):
term *= m / i
sum += term
return min(sum, 1.0)
-(1) What is fscore
calculated?
-According to fisher's method, multiply by k
independent p-value
s (like probabilities) and log If you take
and multiply by -2
, you can calculate the p-value
of the χ-square distribution with the degree of freedom 2k
.
―― ~~ I haven't followed the proof yet ~~
--For p-value
, please refer to Statistical method χ-square test.
-[Click here for details on the chi-square distribution](https://ja.wikipedia.org/wiki/%E3%82%AB%E3%82%A4%E4%BA%8C%E4%B9%97%E5% 88% 86% E5% B8% 83)
-(2) What is ʻinvchi2calculating? --Inverse function of the cumulative distribution function of the chi-square distribution --The chi-square test often finds the
p-value and makes a judgment based on the case where it is less than
0.05 (the chi-square value is sufficiently high), but it has a high score in the theme of spam classification. = It seems that the inverse function is used to return to the χ-square value (value that follows the χ-square distribution) because we want to make it spam. --Arguments --
chi・ ・ ・
p-value (
0 ≤ chi ≤ 1) --
df・ ・ ・ χ-square distribution degrees of freedom --Output --Chi-square value (value according to the chi-square distribution). However, it does not exceed
1. -(3) Can the inverse function be calculated? --As you can see from the cumulative density function of the chi-square distribution, the inverse function cannot be calculated. --To calculate the inverse function, [Inverse Function Theorem](https://ja.wikipedia.org/wiki/%E9%80%86%E5%87%BD%E6%95%B0%E5%AE% It seems that 9A% E7% 90% 86) is used to bring it into a differential equation, and a power series (like an nth-order equation) is used to obtain an approximate solution. ――It seems that the method of calculating the inverse function of the cumulative distribution function with high accuracy is being researched every day, so it seems that it was implemented by an algorithm that adopted an accurate calculation method. ――You can search various methods by searching the paper with keywords such as
Quantile Chi-Square. --ʻInverse Chi-Square
is not recommended as it will hit something similar and non-similar, the inverse chi-square distribution.
--Is the Fisher method called the Robinson-Fisher method in Japanese papers? --The Fisher method is obtained from the Robinson method.
Recommended Posts