[PYTHON] Multiplicative hash is not perfect (proven)

Overview

note

Introduction

On Mercari's blog, there is an entry [metal_unk 17] that proves that the Knuth multiplicative hash is the least complete hash function, showing the proof that the Knuth multiplicative hash is the least complete.

I read that entry with great interest because practical hash functions are usually described as causing index collisions [Weiss 14]. So far, I've found three things:

Regarding the first point, in the Hatena bookmark comment,

What I'm doing is the same, but it seems that "prime and max are relatively prime, so there is an inverse element of prime under the law max. Therefore, bijection".

[smoking186 17]

It looks like f (n) = ((n * PRIME) mod MAX) + 1. Bijection is self-evident because there is an inverse element of PRIME in mod MAX. Also, even if restoreability is a requirement (because if the inverse element is not in the domain, you can play it), it does not have to be surjective.

Some people, like [mather314 17], understand that it's self-explanatory. At first, I couldn't understand it from the blog entry or this bookmark comment alone, so I'd like to understand it by myself. I wrote a proof that doesn't use. But after that, I was able to write a concise proof based on these comments. Thank you.

Regarding the second point, the definition of multiplicative hash is roughly divided into two types: procedures that use real numbers (hereinafter referred to as multiplication hashes (real numbers)) and procedures that use integer operations and bit shift operations (hereinafter, multiplication hashes). (Called (integer value)). Both of them, unlike the hash function of Mercari, have been confirmed by computer experiments to return the result of index collision. However, refer to Knuth's original book [Knuth 73]. Since there was no way to do this, I supplemented the multiplicative hash with lecture materials from different authors. I read those parts in a patchwork, which may contain incorrect explanations due to inconsistent understanding. Please be aware of the sex.

For the third point, using the implemented Python code and execution results, we show that the multiplicative hash is not perfect from the concrete example of collision. This makes Mercari's hash function essentially a multiplicative hash. You can see that it is different.

If you find any mistakes in the description, please point it out. Thank you.

Mercari hash function

** Definition 1 ** $ p $ is a prime number greater than or equal to 3, $ m $ is a power of 2 representing the table size of the hash function ($ m = 2 ^ r $ using a positive integer $ r $), respectively. At this time, the key value $ n $ is converted to an index ** Mercari's hash function ** $ f (n) $ is defined as follows.

f(n) = (np\mod m) + 1.

The range of values for the $ f $ key (domain) and the range of values for the index (range) are both [1, ..., m].

** Lemma 1 ** Let $ a and m $ be coprime positive integers. For $ 0 \ leq n \ le m-1 $, let $ g (n) = an \ mod m $. For a pair of $ n, n'$ ($ 0 \ leq n <n'\ leq m-1 $), it becomes $ g (n) \ neq g (n') $.

** Proposition 1 ** For two arbitrarily given different keys $ n, n'(1 \ leq n, n'\ leq m, n \ neq n') $, the hash function of Mercari is It becomes $ f (n) \ neq f (n') $.

** Series 1 ** In Mercari's hash function $ f $, $ f $ is bijective as long as $ a and m $ are relatively prime.

And the hash function $ f (n) $ of Mercari has the following features. ** Proposition 2 ** When the key $ n $ is odd, $ f (n) $ is even, and when $ n $ is even, $ f (n) $ is odd.

Explain the meaning of Proposition 2. When copying from a set of indexes to a set of keys, Mercari's hash function separates the two subsets of indexes into two subsets of each key. The following code outputs the index of Mercari's hash function for $ a = 40507 $, table size $ m = 16 $, and the index of the multiplication hash (integer value) under the same conditions for comparison.

hashsample.py


import math

def kmh2(n, a, r, w): # Knuth's multiplicative hash
    """
    defined in a handout of lecture 5 Hashing 1
    http://courses.csail.mit.edu/6.006/fall09/lecture_notes/lecture05.pdf
    """
    return ((a * n) % 2**w) >> (w-r)

def kmh(n, phi, tablesize): # Knuth's multiplicative hash
    """
    defined in a handout on hash functions 
    classes.engineering.wustl.edu/cse241/handouts/hash-functions.pdf
    """
    return int(math.floor(tablesize * (phi * n - math.floor(phi * n))))

def mmh(n, prime, tablesize): # Mercari's multiplicative hash
    return ((n * prime) % tablesize) + 1

def main():
    phis = [0.5*(math.sqrt(5)-1.0)]
    w = 16
    for phi in phis:
        a = 40507 # fixed prime
        # a = 2*int(2**(w-1)*phi)+1
        for v in range(4, 5):
            print "======="
            tablesize = 2**v
            print "tablesize", tablesize
            mmhvals = [mmh(x+1, a, tablesize) for x in range(tablesize)]
            kmh2vals = [kmh2(x, a, v, w) for x in range(tablesize)]
            print "mercali ", mmhvals
            print "multi   ", kmh2vals

main()

The execution result is as follows.

hashsample.txt


=======
tablesize 16
mercali  [1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15, 10, 5, 16, 11, 6]
multi    [0, 9, 3, 13, 7, 1, 11, 5, 15, 9, 2, 12, 6, 0, 10, 4]

The results of this experiment are shown in the following figure. However, the keys and indexes are arranged separately by odd numbers and even numbers. The keys and indexes in which the collision occurred are filled in the same color (red, green). The arrow that represents is represented by a thick line. The index that is not converted from any key is filled in light blue.

メルカリのハッシュ関数はインデックスの偶奇でキーの偶奇が決まる

Proposition 2 shows that there is no key value that exceeds the boundary between odd and even indexes. Proposition 2 does not hold for the multiplication hash (integer value) described later, and the keys are scattered. You can see from the figure.

In addition, Mercari's hash function has the same odd and even pattern for index to key correspondence. The pattern from the previous example is shown in the following figure. The index and key pair is filled with the same color.

メルカリのハッシュ関数にはパターンがある

Mercari's hash function always increases the index value by $ 2p $ if the key value increases by 2, so such a pattern always appears, and by properly aligning the first value of the index, from the index to the key at odd numbers. The correspondence of and the correspondence of even numbers can be matched.

multiplicative hash

Multiplication hash (real number)

First, a multiplication hash (real value) is defined based on the lecture material [Burler 15].

** Definition 2 ** $ m $ is the table size of the hash function, $ \ phi $ is a real number between 0 and 1. ** Multiply hash (real number) for key $ n $ ** $ h_1 (n) ) $ Is defined as follows.

h(n) = \lfloor m(\phi n - \lfloor \phi n\rfloor)\rfloor.

The value of $ \ phi $ is Knuth's recommended golden ratio $ (\ sqrt {5} -1) / 2 $, $ \ sqrt {2} / 2 $, $ \ sqrt {3} -1 $ Irrational numbers such as [Burler 15] are often used.

The range of values for the key of $ h_1 $ (domain) and the range of values for the index (range) are both [0, ..., m-1].

The calculation procedure of this function is also expressed as follows [FG 00].

s = phi * n
x = fractional part of s
h_1(n) = floor(m*x)

Neither of them includes $ + 1 $ added in Mercari's hash function, but in the following, we consider it as $ h_1 (n) $ without adding 1.

The Python implementation of the multiplication hash (real number) is shown below.

multihash1.py


import math

def kmh(n, phi, tablesize): # Knuth's multiplicative hash
    """
    defined in a handout on hash functions 
    classes.engineering.wustl.edu/cse241/handouts/hash-functions.pdf
    """
    return int(math.floor(tablesize * (phi * n - math.floor(phi * n))))

#def mmh(n, prime, tablesize): # Mercari's multiplicative hash
#    return ((n * prime) % tablesize) + 1

def main():
    phis = [0.5*(math.sqrt(5)-1.0), 0.5*math.sqrt(2), math.sqrt(3)-1.0] # 1st: the golden ratio
    for phi in phis:
        for v in range(2, 6): # tablesize = 4, 8, 16, 32
            print 
            tablesize = 2**v
            kmhvals = [kmh(x, phi, tablesize) for x in range(tablesize)]
            print kmhvals
            if (len(list(set(kmhvals))) != len(kmhvals)):
                print "collision found (phi:", phi, "tablesize:", tablesize, ")"
                #kmhvals.sort()
                kmhset = set(kmhvals)
                print "indices", kmhvals
                missed = sorted(list(set([x for x in range(tablesize)]).difference(kmhset)))
                print "missed indices", missed
            else: print "perfect (phi:", phi, "tablesize:", tablesize, ")"

main()

Output for each combination of table size $ m = 4, 8, 16, 32, \ phi = (\ sqrt {5} -1) / 2, \ sqrt {2} / 2, \ sqrt {3} -1 $ The results show some of the index collisions.

multihash1.txt


indices [0, 9, 3, 13, 7, 1, 11, 5, 15, 8, 2, 12, 6, 0, 10, 4]
collision found (phi: 0.61803398875 tablesize: 16 )
missed indices [14]

indices [0, 19, 7, 27, 15, 2, 22, 10, 30, 17, 5, 25, 13, 1, 20, 8, 28, 16, 3, 23, 11, 31, 19, 6, 26, 14, 2, 21, 9, 29, 17, 5]
collision found (phi: 0.61803398875 tablesize: 32 )
missed indices [4, 12, 18, 24]

indices [0, 2, 1, 0]
collision found (phi: 0.707106781187 tablesize: 4 )
missed indices [3]

indices [0, 2, 1, 0]
collision found (phi: 0.732050807569 tablesize: 4 )
missed indices [3]

multihash1.py iscollision found (phi, tablesize), list [0, ..., m-1] when indexes collide with a certain combination of $ m $ and $ \ phi $. Lists the indexes for, and displays the values that did not appear in the index due to the collision.

Multiplying hashes (real values) caused index collisions, but it is important to note that rounding errors can lead to incorrect results in these calculations using real values. Integer value) is implemented and compared.

Multiplication hash (integer value)

Based on page 6 of [DD 09], we define a multiplicative hash that calculates only with integer values.

** Definition 3 ** Given table size $ 2 ^ r = m $, odd $ a $ satisfying word bit length $ w $, $ 2 ^ {w-1} <a <2 ^ {w} $, respectively ** Multiply hash (integer value) ** is $ h_2 (n) = (an \ mod 2 ^ w) \ gg (w − r) $. However, $ t \ gg s $ is $ Represents an operation that shifts t $ to the right by $ s $.

$ a $ corresponds to $ \ phi $ in the multiplicative hash (real number), and odd numbers that are not very close to either $ 2 ^ {w−1} $ or $ 2 ^ w $ are recommended [DD 09] ..

The relationship between the multiplication hash (real value) and the multiplication hash (integer value) is explained from page 5 of [Weiss 14].

Here, the multiplication hash (real value) and the multiplication hash (integer value) are compared by a computer experiment. Indexes under the same conditions are displayed side by side, and it is confirmed whether an error has occurred in the multiplication hash (real value). The Python code is shown below.

multihash2.py


import math

def kmh2(n, a, r, w): # Knuth's multiplicative hash
    """
    defined in a handout of lecture 5 Hashing 1
    http://courses.csail.mit.edu/6.006/fall09/lecture_notes/lecture05.pdf
    """
    return ((a * n) % 2**w) >> (w-r)

def kmh(n, phi, tablesize): # Knuth's multiplicative hash
    """
    defined in a handout on hash functions 
    classes.engineering.wustl.edu/cse241/handouts/hash-functions.pdf
    """
    return int(math.floor(tablesize * (phi * n - math.floor(phi * n))))

def mmh(n, prime, tablesize): # Mercari's multiplicative hash
    return ((n * prime) % tablesize) + 1

def main():
    phis = [0.5*(math.sqrt(5)-1.0), 0.5*math.sqrt(2), math.sqrt(3)-1.0] # 1st: the golden ratio
    w = 16
    for phi in phis:
        a = 2*int(2**(w-1)*phi)+1
        for v in range(2, 6):
            print "======="
            tablesize = 2**v
            kmhvals = [kmh(x, phi, tablesize) for x in range(tablesize)]
            kmh2vals = [kmh2(x, a, v, w) for x in range(tablesize)]
            print kmhvals
            print kmh2vals

main()

The result of executing this code is shown below.

multihash2.txt


[0, 2, 0, 3]
[0, 2, 0, 3]
=======
[0, 4, 1, 6, 3, 0, 5, 2]
[0, 4, 1, 6, 3, 0, 5, 2]
=======
[0, 9, 3, 13, 7, 1, 11, 5, 15, 8, 2, 12, 6, 0, 10, 4]
[0, 9, 3, 13, 7, 1, 11, 5, 15, 8, 2, 12, 6, 0, 10, 4]
=======
[0, 19, 7, 27, 15, 2, 22, 10, 30, 17, 5, 25, 13, 1, 20, 8, 28, 16, 3, 23, 11, 31, 19, 6, 26, 14, 2, 21, 9, 29, 17, 5]
[0, 19, 7, 27, 15, 2, 22, 10, 30, 17, 5, 25, 13, 1, 20, 8, 28, 16, 3, 23, 11, 31, 19, 6, 26, 14, 2, 21, 9, 29, 17, 5]
=======
[0, 2, 1, 0]
[0, 2, 1, 0]
=======
[0, 5, 3, 0, 6, 4, 1, 7]
[0, 5, 3, 0, 6, 4, 1, 7]
=======
[0, 11, 6, 1, 13, 8, 3, 15, 10, 5, 1, 12, 7, 3, 14, 9]
[0, 11, 6, 1, 13, 8, 3, 15, 10, 5, 1, 12, 7, 3, 14, 9]
=======
[0, 22, 13, 3, 26, 17, 7, 30, 21, 11, 2, 24, 15, 6, 28, 19, 10, 0, 23, 13, 4, 27, 17, 8, 31, 21, 12, 2, 25, 16, 6, 29]
[0, 22, 13, 3, 26, 17, 7, 30, 21, 11, 2, 24, 15, 6, 28, 19, 10, 0, 23, 13, 4, 27, 17, 8, 31, 21, 12, 2, 25, 16, 6, 29]
=======
[0, 2, 1, 0]
[0, 2, 1, 0]
=======
[0, 5, 3, 1, 7, 5, 3, 0]
[0, 5, 3, 1, 7, 5, 3, 0]
=======
[0, 11, 7, 3, 14, 10, 6, 1, 13, 9, 5, 0, 12, 8, 3, 15]
[0, 11, 7, 3, 14, 10, 6, 1, 13, 9, 5, 0, 12, 8, 3, 15]
=======
[0, 23, 14, 6, 29, 21, 12, 3, 27, 18, 10, 1, 25, 16, 7, 31, 22, 14, 5, 29, 20, 11, 3, 26, 18, 9, 1, 24, 15, 7, 30, 22]
[0, 23, 14, 6, 29, 21, 12, 3, 27, 18, 10, 1, 25, 16, 7, 31, 22, 14, 5, 29, 20, 11, 3, 26, 18, 9, 1, 24, 15, 7, 30, 22]

In the variable settings I tried with this code, the multiplication hash (real number) and the multiplication hash (integer value) returned the same index and caused the same collision.

** Proposition 3 ** Multiplicative hash is not perfect.

Furthermore, looking at the list of indexes multihash2.txt, there are places where odd numbers and even numbers appear in succession, and Proposition 2 does not hold for $ h_1 (n) and h_2 (n) $. $ H_1 (n), h_2 (n) $ is perceived as sacrificing completeness while increasing the scatter of the index.

The differences between Mercari's hash function $ f (n) $ and the multiplication hash (integer value) $ h_2 (n) $ can be summarized in the following two points.

Finally

appendix

** Proof of Lemma 1 ** Proof by reductio ad absurdum. For some $ n, n'$, assume $ g (n) = g (n') $.

\begin{eqnarray*}
g(n')-g(n)
&=&(an' - an)\mod m\\
&=&a(n'-n)\mod m\\
&=&0
\end{eqnarray*}

Since $ a and m $ are relatively prime, $ n'-n $ must be a multiple of $ m $ to make this difference 0. But $ n, n'$ From the definition of $ 0 <n'-n <m $. Therefore, there is no such pair of $ n, n'$. $ \ Box $

** Another proof of Lemma 1 ** The inverse element of $ p $ under the method $ m $ is written as $ p ^ {-1} $. $ N \ neq n'(0 \ leq n, n'\ For leq m-1) $, assume that $ f (n) = f (n') $ holds. At this time,

\begin{eqnarray*}
n\mod m
&=&
(p^{-1}np)\mod m\\
&=&
(p^{-1}f(n)\mod m)\\
&=&
(p^{-1}f(n')\mod m)\\
&=&
(p^{-1}n'p)\mod m\\
&=&
n'\mod m
\end{eqnarray*}

From $ n = n'$ holds, which is contrary to the choice of $ n, n'$. Therefore, the assumption is incorrect. $ \ Box $

** Proof of Proposition 1 ** Using the function $ g (n) $ of Lemma 1, we can write $ f (n) = g (n) + 1 $. However, because the domain is different, $ f (m) = g (0) $. There is a one-to-one correspondence between $ f $ and $ g $, with different $ g (n) $ values for different $ n (0 \ leq n \ leq m-1) $. For different $ n (1 \ leq n \ leq m) $, $ f (n) $ also takes different values, as we know from Lemma 1 to return. $ \ Box $

** Proof of system 1 ** Clear from the conditions of Lemma 1. $ \ Box $

** Proof of Proposition 2 ** Let $ q $ be a positive integer that satisfies $ p = 2q + 1 $.

(1) When $ n $ is odd. $ k $ is an integer that satisfies $ n = 2k + 1 $ ($ 0 \ leq k <m / 2 $). The following holds for $ f (2k + 1) $.

\begin{eqnarray*}
f(2k+1) 
&=& ((2k+1)(2q+1)) \mod m + 1\\
&=& (4kq \mod m)+(2(k+q) \mod m) + (1 \mod m) + 1\\
&=& (4kq \mod m)+(2(k+q) \mod m) + 2.
\end{eqnarray*}

Since both $ 4kq $ and $ 2 (k + q) $ are even, the remainder of $ m $ for them is also even, and $ f (2k + 1) $, which is the sum of even numbers, is even.

(2) When $ n $ is an even number. $ k $ is an integer that satisfies $ n = 2k $ ($ 0 <k \ leq m / 2 $). The following holds for $ f (2k) $.

\begin{eqnarray*}
f(2k) 
&=& (2k(2q+1)) \mod m + 1\\
&=&  (4kq \mod m)+(2k \mod m)  + 1.
\end{eqnarray*}

Since both $ 4kq and 2k $ are even, the remainder of their $ m $ is also even, and $ f (2k) $ plus 1 is odd. $ \ Box $

** Proof of Proposition 3 ** Prove by the counterexample obtained by the above computer experiment. $ R = 2, m = 4, w = 16, a = 2 \ cdot \ floorloor 2 ^ {w-1} (\ sqrt {5} -1) / 2) \ rfloor + 1 = 40503 $. At this time, $ h_2 (0) = h_2 (2) = 0 $ and the indexes collide. Also, $ a $ is a prime number. Even if you replace it with $ 40507 $, $ h_2 (0) = h_2 (2) = 0 $. $ \ Box $

References

All the literature on the net was browsed from August 29th to 31st, 2017. Also, [Knuth 73] was not browsed.

[Buhler 15] Buhler, J. Choosing hash functions, handout of CSE 241 (Algorithms and Data Structures), Dept. Computer Science and Engineering, Washington University in St. Louis, https://classes.engineering.wustl.edu/cse241/handouts/hash-functions.pdf, 2015.

[DD 09] Devadas, S. and Daskalakis, K. Hashing I: Chaining, Hash Functions, Handout of 6.006 (Introduction to Algorithms), Dept Electrical Engineering and Computer Science, Massachusetts Institute of Technology, https://courses.csail.mit.edu/6.006/fall09/lecture_notes/lecture05.pdf, 2009.

[FG 00] Fleck, M. and Kuenning, G. Hash Functions, Note for CS 70 (Data Structures and Program Development), Dept. Computer Science, Harvey Mudd College, https://www.cs.hmc.edu/~geoff/classes/hmc.cs070.200101/homework10/hashfuncs.html, 2000.

[Knuth 73] Knuth, D. The Art of Computer Programming, Volume 3: Sorting and Searching, Addison-Wesley, 1973.

[metal_unk 07] metal_unk, Proving that Knuth multiplicative hash is the smallest complete hash function, Mercari Tech Blog, http://tech.mercari.com/entry/2017/08/29/115047.

[mather314 17] mather314, Hatena Bookmark Comments, http://b.hatena.ne.jp/mather314/20170829#bookmark-344131614, 2017.

[smoking186 17] smoking186, Hatena Bookmark Comment, http://b.hatena.ne.jp/smoking186/20170829#bookmark-344131614, 2017.

[Weiss 14] Weiss, S. Chapter 5: Hashing and Hash Tables, Handout of CSci 335 (Software Design and Analysis III), Dept. Computer Science, Hunter College of the City University of New York, http://compsci.hunter.cuny.edu/~sweiss/course_materials/csci335/lecture_notes/chapter05.pdf, 2014.

Recommended Posts

Multiplicative hash is not perfect (proven)
Python round is not strictly round
Is time.time () not very accurate?
TypeError:'int' object is not subscriptable
Python list is not a list
NameError: name'__file__' is not defined