[PYTHON] Calculation of similarity by MinHash

I implemented a sample of similarity calculation by MinHash.

Reference Fast and space-saving similarity judgment by b-Bit Min Hash

According to the above literature, MurmurHash3 (python wrapper) for hash calculation. ) Was used.

`minhash.py`


# -*- coding: utf-8


import mmh3


MIN_HASH_VALUE = 2 ** 128


def min_hash(words, seed):
    min_hash_word = None
    min_hash_value = MIN_HASH_VALUE

    for word in words:
        hash_ = mmh3.hash128(word, seed)
        if hash_ < min_hash_value:
            min_hash_word = word
            min_hash_value = hash_

    return min_hash_word


def get_score(s1, s2, k):
    """Get the similarity between s1 and s2 using min-hash algorithm

    k: the number of hash functions
    """
    num_match = 0

    for seed in xrange(k):
        if min_hash(s1, seed) == min_hash(s2, seed):
            num_match += 1

    return float(num_match) / k


def main():
    s1 = ['a', 'b']
    s2 = ['a']
    s3 = ['b']
    # a-z
    s4 = [chr(ascii) for ascii in xrange(97, 123)]

    k = 2 ** 10

    # Should be ~0.074
    print get_score(s1, s4, k)
    # Should be ~0.370
    print get_score(s2, s4, k)
    # Should be ~0.370
    print get_score(s3, s4, k)


if __name__ == '__main__':
    main()

[yamaneko]$ time python min_hash.py 
0.078125
0.046875
0.03125

real    0m0.112s
user    0m0.076s
sys     0m0.015s

When the value of k was increased, the value converged, but the calculation time increased.

Next, I want to implement b-bit Min Hash.

Recommended Posts

Calculation of similarity by MinHash

[Python] Calculation of image similarity (Dice coefficient)

Calculation of technical indicators by TA-Lib and pandas

Numerical calculation of compressible fluid by finite volume method

Calculation of similarity between sentences using Word2Vec (simplified version)

Calculation of homography matrix by least squares method (DLT method)

Visualization of data by prefecture

[Scientific / technical calculation by Python] Basic operation of arrays, numpy

About cost calculation of MeCab

Comparison of calculation speed by implementation of python mpmath (arbitrary precision calculation) (Note)

Visualization of matrix created by numpy

Error-free calculation with big.Float of golang

Calculation of time series customer loyalty

Error divided by 0 Processing of ZeroDivisionError 2

Play with numerical calculation of magnetohydrodynamics

Expansion by argument of python dictionary

Calculation of normal vector using convolution

Error divided by 0 Handling of ZeroDivisionError

Judgment of if by list comprehension

Summary of basic implementation by PyTorch

Train_test_split of features held by dict

Calculation of DC electric circuit current

Calculation of the number of Klamer correlations

Calculation of homebrew class and existing class

List of packages installed by conda

Plot of regression line by residual plot

10 selections of data extraction by pandas.DataFrame.query

Behavior of python3 by Sakura's server

Animation of geographic data by geopandas

[Python] Calculation of Kappa (k) coefficient

Speed improvement by self-implementation of numpy.random.multivariate_normal

Example of code rewriting by ast.NodeTransformer

Calculation of Spearman's rank correlation coefficient

Story of power approximation by Python

Project Euler 9 Retention of calculation results

Summary of restrictions by file system

[Scientific / technical calculation by Python] Fitting by nonlinear function, equation of state, scipy

[Scientific / technical calculation by Python] Calculation of matrix product by @ operator, python3.5 or later, numpy

[Control engineering] Calculation of transfer functions and state space models by Python