Einführung

Das Schwierigste bei der Überprüfung der Funktionsweise von Robinsons in PHP geschriebenem Bayesian Spam Filter war die Berechnung des Chi-Quadrats. Nachdem er verschiedene Dinge recherchiert hatte, kam ein Python-Programm heraus, das Robinson selbst geschrieben hatte. Ich mochte das nicht, also fügte ich einige Verarbeitungsschritte hinzu, um die Berechnung von PHP zu überprüfen, und als ich bemerkte, machte ich eine Klasse von Bayes'schen Filtern.

Ich konnte die Implementierung von Robinsons Bayesian Spam Filter nicht so oft finden, selbst wenn ich ihn gegoogelt habe, also habe ich beschlossen, ihn vorerst zu veröffentlichen. Ich bin neu in Python, es tut mir leid, wenn ich viele Fehler mache. Wir heissen dich willkommen. Ich hoffe es hilft jemandem. Übrigens habe ich nicht die Absicht, das Urheberrecht zu beanspruchen. Bitte mögen Sie es durch Kochen oder Backen.

Klicken Sie hier, um eine japanische Erklärung des Basian-Filters zu erhalten. http://akademeia.info/index.php?%A5%D9%A5%A4%A5%B8%A5%A2%A5%F3%A5%D5%A5%A3%A5%EB%A5%BF

Programm

""" Robinson's Spam filter program
This program is inspired by the following article
http://www.linuxjournal.com/article/6467?page=0,0
"""
import math

class RobinsonsBayes(object):
    """RobinsonsBayes
        This class only support calculation assuming you already have training set.
    """
    x = float(0.5) #possibility that first appeard word would be spam
    s = float(1)   #intensity of x
    def __init__(self,spam_doc_num,ham_doc_num):
        self.spam_doc_num = spam_doc_num
        self.ham_doc_num = ham_doc_num
        self.total_doc_num = spam_doc_num+ham_doc_num
        self.possibility_list = []

    def CalcProbabilityToBeSpam(self,num_in_spam_docs,num_in_ham_docs):
        degree_of_spam = float(num_in_spam_docs)/self.spam_doc_num;
        degree_of_ham  = float(num_in_ham_docs)/self.ham_doc_num;

        #p(w)
        probability = degree_of_spam/(degree_of_spam+degree_of_ham);

        #f(w)
        robinson_probability = ((self.x*self.s) + (self.total_doc_num*probability))/(self.s+self.total_doc_num)
        return robinson_probability

    def AddWord(self,num_in_spam_docs,num_in_ham_docs):
        probability = self.CalcProbabilityToBeSpam(num_in_spam_docs,num_in_ham_docs)
        self.possibility_list.append(probability)
        return probability

    #retrieved from
    #http://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/064/6467/6467s2.html
    def chi2P(self,chi, df):
        """Return prob(chisq >= chi, with df degrees of freedom).
        df must be even.
        """
        assert df & 1 == 0

        # XXX If chi is very large, exp(-m) will underflow to 0.
        m = chi / 2.0
        sum = term = math.exp(-m)
        for i in range(1, df//2):
            term *= m / i
            sum += term
        # With small chi and large df, accumulated
        # roundoff error, plus error in
        # the platform exp(), can cause this to spill
        # a few ULP above 1.0. For
        # example, chi2P(100, 300) on my box
        # has sum == 1.0 + 2.0**-52 at this
        # point.  Returning a value even a teensy
        # bit over 1.0 is no good.
        return min(sum, 1.0)

    def CalcNess(self,f,n):
        Ness = self.chi2P(-2*math.log(f),2*n)
        return Ness

    def CalcIndicator(self):
        fwpi_h=fwpi_s=1
        for fwi in self.possibility_list:
            fwpi_h *= fwi
            fwpi_s *= (1-fwi)

        H = self.CalcNess(fwpi_h,3)
        S = self.CalcNess(fwpi_s,3)

        #Notice that the bigger H(Hamminess) indicates that the document is more likely to be SPAM.
        I = (1+H-S)/2
        return I

if __name__ == '__main__':
    """
        This is a exapmple of checking if "I have a pen" is a spam.

        Following program assuming like:
            - We have 10 spam documents and 10 ham documents in our hand.
            - Number of "I" in spam documents is 1 and that of ham documents is 5
            - Number of "have" in spam documents is 2 and that of ham documents is 6
            - Number of "a" in spam documents is 1 and that of ham documents is 2
            - Number of "pen" in spam documents is 5 and that of ham documents is 1

        By the way, "I have a pen" is an sentence the most of Japanese learn in the first English class.
        Enjoy!
    """

    #init class by giving the number of document
    RobinsonsBayes = RobinsonsBayes(10,10)

    #Add train data of words one by one
    print "I   : "+str(RobinsonsBayes.AddWord(1,5))
    print "have: "+str(RobinsonsBayes.AddWord(2,6))
    print "a   : "+str(RobinsonsBayes.AddWord(1,2))
    print "pen : "+str(RobinsonsBayes.AddWord(5,1))

    #calculate Indicater
    print "I (probability to be spam)"
    print RobinsonsBayes.CalcIndicator()
    print ""

Ich habe versucht, Robinsons Bayesian Spam Filter mit Python zu implementieren

Einführung

Programm