I implemented a sample of similarity calculation by MinHash.
Reference Fast and space-saving similarity judgment by b-Bit Min Hash
According to the above literature, MurmurHash3 (python wrapper) for hash calculation. ) Was used.
minhash.py
# -*- coding: utf-8
import mmh3
MIN_HASH_VALUE = 2 ** 128
def min_hash(words, seed):
min_hash_word = None
min_hash_value = MIN_HASH_VALUE
for word in words:
hash_ = mmh3.hash128(word, seed)
if hash_ < min_hash_value:
min_hash_word = word
min_hash_value = hash_
return min_hash_word
def get_score(s1, s2, k):
"""Get the similarity between s1 and s2 using min-hash algorithm
k: the number of hash functions
"""
num_match = 0
for seed in xrange(k):
if min_hash(s1, seed) == min_hash(s2, seed):
num_match += 1
return float(num_match) / k
def main():
s1 = ['a', 'b']
s2 = ['a']
s3 = ['b']
# a-z
s4 = [chr(ascii) for ascii in xrange(97, 123)]
k = 2 ** 10
# Should be ~0.074
print get_score(s1, s4, k)
# Should be ~0.370
print get_score(s2, s4, k)
# Should be ~0.370
print get_score(s3, s4, k)
if __name__ == '__main__':
main()
[yamaneko]$ time python min_hash.py
0.078125
0.046875
0.03125
real 0m0.112s
user 0m0.076s
sys 0m0.015s
When the value of k was increased, the value converged, but the calculation time increased.
Next, I want to implement b-bit Min Hash.
Recommended Posts