[PYTHON] Calculation of similarity by MinHash

I implemented a sample of similarity calculation by MinHash.

Reference Fast and space-saving similarity judgment by b-Bit Min Hash

According to the above literature, MurmurHash3 (python wrapper) for hash calculation. ) Was used.

minhash.py


# -*- coding: utf-8


import mmh3


MIN_HASH_VALUE = 2 ** 128


def min_hash(words, seed):
    min_hash_word = None
    min_hash_value = MIN_HASH_VALUE

    for word in words:
        hash_ = mmh3.hash128(word, seed)
        if hash_ < min_hash_value:
            min_hash_word = word
            min_hash_value = hash_

    return min_hash_word


def get_score(s1, s2, k):
    """Get the similarity between s1 and s2 using min-hash algorithm

    k: the number of hash functions
    """
    num_match = 0

    for seed in xrange(k):
        if min_hash(s1, seed) == min_hash(s2, seed):
            num_match += 1

    return float(num_match) / k


def main():
    s1 = ['a', 'b']
    s2 = ['a']
    s3 = ['b']
    # a-z
    s4 = [chr(ascii) for ascii in xrange(97, 123)]

    k = 2 ** 10

    # Should be ~0.074
    print get_score(s1, s4, k)
    # Should be ~0.370
    print get_score(s2, s4, k)
    # Should be ~0.370
    print get_score(s3, s4, k)


if __name__ == '__main__':
    main()
[yamaneko]$ time python min_hash.py 
0.078125
0.046875
0.03125

real    0m0.112s
user    0m0.076s
sys     0m0.015s

When the value of k was increased, the value converged, but the calculation time increased.

Next, I want to implement b-bit Min Hash.

Recommended Posts

Calculation of similarity by MinHash
[Python] Calculation of image similarity (Dice coefficient)
Calculation of technical indicators by TA-Lib and pandas
Numerical calculation of compressible fluid by finite volume method
Calculation of similarity between sentences using Word2Vec (simplified version)
Calculation of homography matrix by least squares method (DLT method)
Visualization of data by prefecture
[Scientific / technical calculation by Python] Basic operation of arrays, numpy
About cost calculation of MeCab
Comparison of calculation speed by implementation of python mpmath (arbitrary precision calculation) (Note)
Visualization of matrix created by numpy
Error-free calculation with big.Float of golang
Calculation of time series customer loyalty
Error divided by 0 Processing of ZeroDivisionError 2
Play with numerical calculation of magnetohydrodynamics
Expansion by argument of python dictionary
Calculation of normal vector using convolution
Error divided by 0 Handling of ZeroDivisionError
Judgment of if by list comprehension
Summary of basic implementation by PyTorch
Train_test_split of features held by dict
Calculation of DC electric circuit current
Calculation of the number of Klamer correlations
Calculation of homebrew class and existing class
List of packages installed by conda
Plot of regression line by residual plot
10 selections of data extraction by pandas.DataFrame.query
Behavior of python3 by Sakura's server
Animation of geographic data by geopandas
[Python] Calculation of Kappa (k) coefficient
Speed improvement by self-implementation of numpy.random.multivariate_normal
Example of code rewriting by ast.NodeTransformer
Calculation of Spearman's rank correlation coefficient
Story of power approximation by Python
Project Euler 9 Retention of calculation results
Summary of restrictions by file system
Similarity calculation between episodes of Precure using live timeline and topic model
[Scientific / technical calculation by Python] Fitting by nonlinear function, equation of state, scipy
[Scientific / technical calculation by Python] Calculation of matrix product by @ operator, python3.5 or later, numpy
[Control engineering] Calculation of transfer functions and state space models by Python