I implemented Robinson's Bayesian Spam Filter in python

Introduction

The most troublesome thing when checking the operation of Robinson's Bayesian Spam Filter written in PHP was the calculation of chi-square. After researching various things, a python program written by Robinson himself came out. I didn't like this, so I added some processing to check the calculation of php, and when I noticed, I made a class of Bayesian Filter.

I couldn't find the implementation of Robinson's Bayesian Spam Filter so much even if I googled it, so I decided to publish it for the time being. I'm new to python, so I'm sorry if I make a lot of mistakes. We welcome you. I hope it helps someone. By the way, I don't intend to claim the copyright, so please like it by boiling or baking.

Click here for a Japanese explanation of the Bayesian filter. http://akademeia.info/index.php?%A5%D9%A5%A4%A5%B8%A5%A2%A5%F3%A5%D5%A5%A3%A5%EB%A5%BF

program

""" Robinson's Spam filter program
This program is inspired by the following article
http://www.linuxjournal.com/article/6467?page=0,0
"""
import math

class RobinsonsBayes(object):
    """RobinsonsBayes
        This class only support calculation assuming you already have training set.
    """
    x = float(0.5) #possibility that first appeard word would be spam
    s = float(1)   #intensity of x
    def __init__(self,spam_doc_num,ham_doc_num):
        self.spam_doc_num = spam_doc_num
        self.ham_doc_num = ham_doc_num
        self.total_doc_num = spam_doc_num+ham_doc_num
        self.possibility_list = []

    def CalcProbabilityToBeSpam(self,num_in_spam_docs,num_in_ham_docs):
        degree_of_spam = float(num_in_spam_docs)/self.spam_doc_num;
        degree_of_ham  = float(num_in_ham_docs)/self.ham_doc_num;

        #p(w)
        probability = degree_of_spam/(degree_of_spam+degree_of_ham);

        #f(w)
        robinson_probability = ((self.x*self.s) + (self.total_doc_num*probability))/(self.s+self.total_doc_num)
        return robinson_probability

    def AddWord(self,num_in_spam_docs,num_in_ham_docs):
        probability = self.CalcProbabilityToBeSpam(num_in_spam_docs,num_in_ham_docs)
        self.possibility_list.append(probability)
        return probability

    #retrieved from
    #http://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/064/6467/6467s2.html
    def chi2P(self,chi, df):
        """Return prob(chisq >= chi, with df degrees of freedom).
        df must be even.
        """
        assert df & 1 == 0

        # XXX If chi is very large, exp(-m) will underflow to 0.
        m = chi / 2.0
        sum = term = math.exp(-m)
        for i in range(1, df//2):
            term *= m / i
            sum += term
        # With small chi and large df, accumulated
        # roundoff error, plus error in
        # the platform exp(), can cause this to spill
        # a few ULP above 1.0. For
        # example, chi2P(100, 300) on my box
        # has sum == 1.0 + 2.0**-52 at this
        # point.  Returning a value even a teensy
        # bit over 1.0 is no good.
        return min(sum, 1.0)

    def CalcNess(self,f,n):
        Ness = self.chi2P(-2*math.log(f),2*n)
        return Ness

    def CalcIndicator(self):
        fwpi_h=fwpi_s=1
        for fwi in self.possibility_list:
            fwpi_h *= fwi
            fwpi_s *= (1-fwi)

        H = self.CalcNess(fwpi_h,3)
        S = self.CalcNess(fwpi_s,3)

        #Notice that the bigger H(Hamminess) indicates that the document is more likely to be SPAM.
        I = (1+H-S)/2
        return I

if __name__ == '__main__':
    """
        This is a exapmple of checking if "I have a pen" is a spam.

        Following program assuming like:
            - We have 10 spam documents and 10 ham documents in our hand.
            - Number of "I" in spam documents is 1 and that of ham documents is 5
            - Number of "have" in spam documents is 2 and that of ham documents is 6
            - Number of "a" in spam documents is 1 and that of ham documents is 2
            - Number of "pen" in spam documents is 5 and that of ham documents is 1

        By the way, "I have a pen" is an sentence the most of Japanese learn in the first English class.
        Enjoy!
    """

    #init class by giving the number of document
    RobinsonsBayes = RobinsonsBayes(10,10)

    #Add train data of words one by one
    print "I   : "+str(RobinsonsBayes.AddWord(1,5))
    print "have: "+str(RobinsonsBayes.AddWord(2,6))
    print "a   : "+str(RobinsonsBayes.AddWord(1,2))
    print "pen : "+str(RobinsonsBayes.AddWord(5,1))

    #calculate Indicater
    print "I (probability to be spam)"
    print RobinsonsBayes.CalcIndicator()
    print ""

Recommended Posts

I implemented Robinson's Bayesian Spam Filter in python
I tried using Bayesian Optimization in Python
I implemented Cousera's logistic regression in Python
Implemented in Python PRML Chapter 1 Bayesian Inference
Implemented SimRank in Python
Filter List in Python
I implemented Python Logging
I implemented the inverse gamma function in python
Implemented Shiritori in Python
I implemented breadth-first search in python (queue, drawing self-made)
I implemented a Vim-like replacement command in Slackbot #Python
Sudoku solver implemented in Python 3
I understand Python in Japanese!
What I learned in Python
6 Ball puzzle implemented in python
I implemented Donald Knuth's unbiased sequential calculation algorithm in Python
Implemented image segmentation in python (Union-Find)
Bayesian optimization package GPyOpt in Python
[Python] I implemented peripheral Gibbs sampling
Widrow-Hoff learning rules implemented in Python
I wrote Fizz Buzz in Python
Implemented label propagation method in Python
I learned about processes in Python
I can't install scikit-learn in Python
I wrote the queue in Python
Implemented Perceptron learning rules in Python
I tried Line notification in Python
Implemented in 1 minute! LINE Notify in Python
I wrote the stack in Python
I put Python 2.7 in Sakura VPS 1GB.
I tried to implement PLSA in Python
Duck book implemented in Python "Bayesian statistical modeling with Stan and R"
A simple HTTP client implemented in Python
I tried to implement permutation in Python
I made a payroll program in Python!
Implemented in Python PRML Chapter 7 Nonlinear SVM
I tried to implement PLSA in Python 2
I tried to implement Bayesian linear regression by Gibbs sampling in python
I can't debug python scripts in Eclipse
Rewrite SPSS Modeler filter nodes in Python
I tried to implement ADALINE in Python
I wanted to solve ABC159 in Python
I tried to implement PPO in Python
Implemented in Python PRML Chapter 5 Neural Networks
I searched for prime numbers in python
Implemented Stooge sort in Python3 (Bubble sort & Quicksort)
I created a password tool in Python.
CheckIO (Python)> Non-unique Elements> I implemented it
Implemented DQN in TensorFlow (I wanted to ...)
Why can't I install matplotlib in python! !!
I made a module in C language to filter images loaded by Python
I want to do Dunnett's test in Python
A memo that I wrote a quicksort in Python
I was able to recurse in Python: lambda
I want to create a window in Python
I tried playing a typing game in Python
When I try matplotlib in Python, it says'cairo.Context'
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python
Implemented memoization recursion and exploration in Python and Go
I investigated in detail about variable processing in python