Motivation

I wanted to categorize URLs and tried using b-Bit Min Hash. Specifically, it is for doing the following. When calculating the user's CV probability using the site browsing information extracted from the cookie as the feature, the feature vector becomes quite sparse. Therefore, I wanted to divide the URL into categories and increase the density of the feature vector to some extent. For that reason, I am experimenting with the title.

What I wrote in this blog

Overview of b-Bit Min Hash
Reference code

Overview of b-Bit Min Hash

References

b-Bit Minwise Hashing
b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions
[Yono Training Room / Natural Language Processing with Python](http://www.nal.ie.u-ryukyu.ac.jp/wiki/index.php?yono%2FPython%A4%C7%BC%AB%C1 % B3% B8% C0% B8% EC% BD% E8% CD% FD) * The article is old, but it summarizes the natural language processing in python as a whole.
Remove tags from HTML documents with BeautifulSoup (Python) * Thank you very much for your crawler relations.
Fast and space-saving similarity judgment by b-Bit Min Hash * This is a development blog by SmartNews. I think it's the best material to get an overview of the title
[Collective Intelligence Programming](http://www.amazon.co.jp/%E9%9B%86%E5%90%88%E7%9F%A5%E3%83%97%E3%83%AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-Toby-Segaran / dp / 4873113644 /? _ Encoding = UTF8 & camp = 247 & creative = 1211 & linkCode = ur2 & tag = shimashimao06 -twenty two)

b-Bit Min Hash procedure

For an overview of the method, see SmartNews' development blog I would like you to refer to PFI's Mr. Okanohara's material that appears in the text. It's much easier to understand than I write the explanation.

So, here I will briefly describe the procedure for writing code.

First, prepare teacher data

Prepare teacher data (prepare a list of categorized URLs to be crawled)
Create a database to store k lists of URLs, categories, word sets, and bi-bits in the minimum hash value (hereinafter, minimum hash value list) (k is the type of hash function (seed)).
Save URLs, categories, and retrieved word sets in the database
Create a minimum hash value list from the acquired word set and update the database

Next, prepare the learner data

Create a learner URL list
Create a database to store k lists for bi-bits in URLs, categories, word sets, and minimum Hash values
Save URL and acquired word set in database

Implementation of classification

Call the word set extracted from each URL from the learner database, create the minimum hash value list, update the DB
Perform MinHash with the created minimum hash value list and the minimum hash value list of the teacher DB
Get category and update DB

Reference code

The code I created is shown below. I made it by cutting and pasting various things, so the naming convention is messed up. .. As usual, it is written by a person who is not an engineer, so the code is quite unsightly, but it would be very helpful if you could read it with a broad mind. ..

Get text data from teacher URL and store word set etc. in DB

`python`


# coding: utf-8

import urllib2
from BeautifulSoup import BeautifulSoup, NavigableString, \
                             Declaration, Comment, BeautifulStoneSoup
import unicodedata
import chardet
from urlparse import urljoin
import sqlite3 as sqlite
import pickle
import SeparateWords as spw

#List of words to ignore
ignorewords = set(['the', 'of', 'to', 'and', 'a', 'in', 'is', 'it', '-', '/', '#', "'", '_', ',', '(', ')', '[', ']', '〜', '!', '|', '♪', '...', '>', '<', ':', '!!', '&', '/', '+', '*'])
#List of tags to recognize as blocks
block_tags = frozenset(['p', 'div', 'table', 'dl', 'ul', 'ol', 'form', 'address',
                   'blockquote', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'fieldset',
                   'hr', 'pre' 'article', 'aside', 'dialog', 'figure',
                   'footer', 'header', 'legend', 'nav', 'section'])

class crawler:
    #Initialize the crawler with the name of the database
    def __init__(self, dbname):
        self.dbname = dbname
        self.con=sqlite.connect(dbname)
        self.con.text_factory = str # utf-Specify str to use 8
    
    def __del__(self):
        self.con.close()

    def dbcommit(self):
        self.con.commit()

    #Index individual pages
    def addtoindex(self,url,category,wordlist):
        if self.isindexed(url): return
        print 'Indexing' + url
                                
        #Store word list in DB for each url
        self.con.execute( "insert into wordvector values(?,?,?,?)" , \
                            (url, category, pickle.dumps(wordlist), '') )
        self.dbcommit()

    def getNavigableStrings(self,soup):
        #Determine if soup is a type that contains contents
        if isinstance(soup, NavigableString): #NavigableString object has all member variables except contents and string
            if type(soup) not in (Comment, Declaration):
                yield soup
        #Get text if it contains contents and is not a program code
        elif soup.name not in ('script', 'style'):
            is_block = soup.name in block_tags #Confirm the existence of the block tag
            if is_block:
                yield u'\n'
            for c in soup.contents:
                for g in self.getNavigableStrings(c):
                    yield replace_str(g)
            if is_block:
                yield u'\n'
    
    #Returns ture if the URL is already indexed
    def isindexed(self,url):
        u=self.con.execute \
            ("select words from wordvector where url='%s'" % url).fetchone()
        if u!=None:
            return True
        return False    
    
    #Receives a list of pages and does a breadth-first search at a given depth
    #Index the page
    def crawl(self,pages,category,depth=2):
        for _ in xrange(depth):
            newpages=set()
            for page in pages:
                try:
                    response=urllib2.urlopen(page).read()
                except:
                    print "Could not open %s" % page
                    continue
                en = chardet.detect(response)['encoding']
                soup=BeautifulSoup(response.decode(en), #Convert character code according to the site
                        convertEntities = BeautifulStoneSoup.HTML_ENTITIES) #Character conversion of HTML special characters
                text = u''.join( self.getNavigableStrings(soup) )
                text = normalizeText(text) #String normalization
                words = spw.separateWords(text) #Get word set
                self.addtoindex(page,category,words) # url, category,Save word set in DB
                
                #Get the URL of the a tag for searching deeper than 2
                links=soup('a')
                for link in links:
                    if ('href' in dict(link.attrs)):
                        url=urljoin(page,link['href'])
                        if url.find("'")!=-1: continue
                        url=url.split('#')[0] #Remove fragment identifier
                        if url[0:4]=='http' and not self.isindexed(url):
                            newpages.add(url)
                pages=newpages

    #Create a database table
    def createindextables(self):
        name = 'wordvector'
        sql="SELECT name FROM sqlite_master WHERE type='table' AND name='MYTABLE';" \
                .replace('MYTABLE', name)
        res = self.con.execute(sql).fetchone()
        if res is None:
            self.con.execute('create table wordvector(url, category, words, hash_vector)')
            self.con.execute('create index urlidx on wordvector(url)')
            self.dbcommit()
        else:
            self.con.execute('drop table wordvector')
            self.con.execute('create table wordvector(url, category, words, hash_vector)')
            self.con.execute('create index urlidx on wordvector(url)')
            self.dbcommit()            

def nonEmptyLines(text):
    """Removes unnecessary whitespace and returns non-blank lines"""
    for line in text.splitlines():
        line = u' '.join(line.split())
        if line:
            yield line

def normalizeText(text):
    """Remove unnecessary whitespace and line breaks after normalization"""
    Text = unicodedata. normalize ('NFKC', text) #Convert Japanese half-width to full-width+Alphanumeric,The symbol is half-width
    return u'\n'.join(nonEmptyLines(text))

def replace_str(line):
    return line.replace('&amp;','&')

if __name__=='__main__':    
    fname = 'teacherURL.txt'
    crawler=crawler('teacher.db')
    crawler.createindextables()
    with open(fname, 'r') as f:
        for line in f:
            s = line.strip().split('\t')
            print s[0]
            category = s[0]
            pagelist = [s[1]]
            crawler.crawl(pagelist, category, depth=1)

The part that extracts words from the text (spw.separateWords (text)) has the following code.

Extract words from text

`python`


# coding: utf-8
import MeCab

#List of words to ignore
ignorewords = set(['the', 'of', 'to', 'and', 'a', 'in', 'is', 'it', 'of'])
igonoresymbols = set(['-', '/', '#', "'", '_', ',', '(', ')', '[', ']',
                     '~', '!', '|', '♪', '...', '>', '<', ':', '!',
                      '&', '/', '+', '*', '【', '】', '（', '）', '！', '：', '-',
                      '＿', '？','%', '「', '」','～','.', '{', '}','"',
                      '＜', '＞', '／'])

def separateWords(text):
    tagger = MeCab.Tagger('-Ochasen')
    node = tagger.parseToNode(text.encode('utf-8'))
    word_set = set([])
    while node:
        feat = node.feature.split(",")
        if feat[0] == "noun" \
                and (feat[1] == "General"
                or feat[1] == "Proper noun" or feat[1] == "Adjectival noun stem"):
            #word = node.surface.decode(en)
            word = node.surface
            if word not in ignorewords:
                word_set.add( word )
        node = node.next
    return word_set

Next, create the minimum hash list for the teacher from the acquired word set, and describe the part to update the DB.

Create and save the minimum hash list

`python`


# coding: utf-8

import sqlite3 as sqlite
import pickle
import hashing

class CreateDB(object):
    def __init__(self,dbname):
        self.db = sqlite.connect(dbname)
        self.db.text_factory = str # utf-Specify str to use 8
        self.hashing = hashing.Hashing()
    
    def __del__(self):
        self.db.close()
        
    def db_commit(self):
        self.db.commit()
        
    def create_hashtable(self, url, wordlist):
        hash_v = self.hashing.get_min_vector(wordlist)
        self.update_vector(url, pickle.dumps(hash_v))
        self.db_commit()

    def update_vector(self,url,vector):
        self.db.execute("update wordvector set hash_vector=? where url=?" , \
                         (vector, url))

if __name__=='__main__':
    dbname = 'teachre.db'
        
    c_db = CreateDB(dbname)
    
    con = sqlite.connect(dbname) #Call the DB where the word vector is stored
    
    itr = 1
    while True:
        sql = "select * from wordvector where rowid=='%d'"        
        res = con.execute(sql % itr).fetchone()
        print res

        if res is None:break
        
        url = res[0]; category = res[1]; wordlist = res[2]
        c_db.create_hashtable(url, wordlist)
        
        itr += 1
    con.close()

The minimum hash list is created as follows.

Creating a minimum hash list

`python`


# coding: utf-8

import numpy as np
import mmh3
from multiprocessing import *

class Hashing(object):
    ''' 
Hashing is applied to the word set obtained at each url to obtain the minimum value.
Get k-dimensional vector for each url
Judgment of similarity by comparing this k-dimensional vector
    '''
    def __init__(self,bbit=1,k=128):
        self.bbit = bbit
        self.hash_cnt = k
    
    def get_min_vector(self,feat_names):
        '''This function will be turned by url'''
        hash_vec = []
        #Get only the minimum hash of a word set k times
        for seed in xrange(self.hash_cnt): # self.hash_Generate seed for cnt
            pool = Pool(processes=8)
            hash_val = pool.map( get_hash_value, ((feat_name, seed) for feat_name in feat_names) ) #Generate k hash functions using k seeds
            pool.close()
            pool.join()
            min_val = int( str( min(hash_val) )[-self.bbit] ) 
            hash_vec.append(min_val) #Get only the first digit of the minimum value
        return hash_vec
        
def get_hash_value((feat_name,seed)):
    return mmh3.hash(feat_name,seed)

At this point, the teacher data is ready. Similarly, get the word set for the learner and store it in the DB.

In the following, MinHash is executed using the acquired learner's word set and the created teacher hash list. Then, add a category to the learner's URL, update the DB, and finish. The code is described below.

Implementation of b-BIt Min Hash

`python`


# coding: utf-8

import numpy as np
import sqlite3 as sqlite
import hashing
import pickle
from multiprocessing import *

class MinHash(object):
    def __init__(self,train_dbname,teacher_dbname):
        self.train_db = sqlite.connect(train_dbname)
        self.teacher_db = sqlite.connect(teacher_dbname)
        self.hashing = hashing.Hashing() # default values are bbit=1, k=128
        
    def __del__(self):
        self.train_db.close()
        self.teacher_db.close()
        
    def db_commit(self):
        self.train_db.commit()
        
    def get_category(self):
        learner_urls = self.get_urls(self.train_db)
        teacher_urls = self.get_urls(self.teacher_db)
        for lrnr_url in learner_urls:
            print "computing hash vector:", lrnr_url[0]
            learner_hash =  self.calculate_hashvecotr( lrnr_url[0] )
            
            print "calculating similarity:", lrnr_url[0]
            learner_words =  self.get_wordslist( self.train_db,lrnr_url[0] )
            
            pool = Pool(processes=8)
            sim = pool.map(  similarity_value,
                        ( (learner_words, self.get_wordslist( self.teacher_db,tchr_url[0] ),
                             learner_hash, self.get_hash_vector(tchr_url[0]))
                                                for tchr_url in teacher_urls )  )  #Calculation of similarity
            pool.close()
            pool.join()
            
            sim = np.array(sim)
            print "sim: ",sim
            idx = np.where(sim==max(sim))[0][0]
            sim_url = teacher_urls[idx][0]
            
            category = self.get_similer_category( sim_url )
            print "similar category of this URL is: ", category
            self.update_category(category, sim_url)

    def calculate_hashvecotr(self, url):
        """Update the original DB while creating a hash vector"""
        hash_vector = self.hashing.get_min_vector( 
                            self.get_wordslist(self.train_db, url) )
        self.train_db.execute( "update wordvector set hash_vector=? where url=?" , 
                                    (url, pickle.dumps(hash_vector), ) ) #Save with pickle
        self.db_commit()
        return hash_vector
    
    def update_category(self,category,url):
        self.train_db.execute( "update wordvector set category=? where url=?" ,(category, url, ) )
        self.db_commit()
    
    def get_wordslist(self,db,url):
        wordlist = db.execute( "select words from wordvector where url=?" , (url,) ).fetchone()[0]
        return pickle.loads(wordlist)
        
    def get_urls(self,db):
        return db.execute("select url from wordvector").fetchall()
    
    def get_similer_category(self,url):
        return self.teacher_db.execute( "select category from wordvector where url=?" ,(url, ) ).fetchone()[0]

    def get_hash_vector(self,url):
        hash_v = self.teacher_db.execute( "select hash_vector from wordvector where url=?" , (url, ) ).fetchone()[0]
        return pickle.loads(hash_v)

def similarity_value( (lrnr_words, teacher_words, lrnr_hash, teacher_hash) ):
    try:
        feat_dim = 1 << len(lrnr_hash)
        C1, C2 = calculate_distribusion(lrnr_words, teacher_words, feat_dim)
        
        jaccard = float( sum( np.array(lrnr_hash)==np.array(teacher_hash) ) ) \
                                                    /len(lrnr_hash) #Calculation of similarity
        return (jaccard-C1)/(1-C2)
    
    except Exception, e:
        import sys
        import traceback
        sys.stderr.write(traceback.format_exc())

def calculate_distribusion(lrnr_words, teacher_words, feat_dim):
    all_words = lrnr_words | teacher_words
    D = len(all_words); f1 = len(lrnr_words); f2 = len(teacher_words) #A simple combination of two word sets is regarded as a whole word set.
    r1 = float(f1)/D; r2 = float(f2)/D; sum_r = r1 + r2
        
    A1 = calc_A(r1, feat_dim); A2 = calc_A(r2,feat_dim)

    C1 = A1*r2/sum_r + A2*r1/sum_r
    C2 = A1*r1/sum_r + A2*r2/sum_r
    
    return C1, C2
    
def calc_A(r, feat_dim):
    A_num = r*(1-r)**(feat_dim-1)
    A_denom = 1-(1-r)**feat_dim
    return A_num/A_denom
    
if __name__=='__main__':
    traindbname = 'learner.db'
    teacherdbname = 'teacher.db'
     
    minhash = MinHash(traindbname,teacherdbname)
    minhash.get_category()

As you can see by reading the paper, it is a prerequisite that D (the size of the word set) and k (the type of hash function) are large enough. If this is not the case, the approximation will not hold, and you may get strange values, so please be careful. In the above code, D is a combination of two word sets, but I did it simply for experimentation. When actually used, set the whole word set to D.

Experimental result

I experimented in the categories of finance, economy, and sports (baseball, golf). The words used in baseball and golf are quite different in sports, so the category was not well assigned in the context of sports. If it is used in practice, I think it is necessary to devise the category itself. By the way, I referred to open directory project (ODP) when creating the category.

When it comes to computational time, creating a list of minimum hash values is the most time consuming. For the time being, I try to perform parallel processing, but I still get the impression that a considerable amount of waiting time will occur. (Since I couldn't afford to measure the average time, I'll just give you an impression.

Also, although I use 32bit hash in the above code, I recommend using 64bit hash. Because 64bit has a lower probability of collision.

Summary (or rather impression)

The approximation method used in MinHash is interesting first. The same approximation method is used in Lightweight data clustering tool bayon introduced in mixi's blog, and it handles a large amount of data. If so, I feel that efficient calculation is difficult without knowing (or making) such an approximation method. ..

B-Bit MinHash as a tool makes a fairly accurate judgment even though it is easy to create. Also, as mentioned in the paper, I feel that there are many situations where it can be applied in practice.

Sorry to trouble you, but I would appreciate it if you could point out any mistakes.

[PYTHON] Categorize sites using b-Bit Min Hash

Motivation

What I wrote in this blog

Overview of b-Bit Min Hash

References

b-Bit Min Hash procedure

First, prepare teacher data

Next, prepare the learner data

Implementation of classification

Reference code

Get text data from teacher URL and store word set etc. in DB

python

Extract words from text

python

Create and save the minimum hash list

python

Creating a minimum hash list

python

Implementation of b-BIt Min Hash

python

Experimental result

Summary (or rather impression)

`python`

`python`

`python`

`python`

`python`