BloomFilter description and implementation sample (JAVA)

1. Bloom Filter description and example

1-1. Prerequisite use case

It's a little difficult, so I'll give you an example. It's neither a proper use case nor a reality.

Create an app that allows you to enter the other party's account number on your smartphone and send money
You cannot send money to a non-existent account number

In this case

Query the database of the server for the account number of the other party
If it does not exist, an error will be returned, and if it exists, the deposit will be processed there.

Will be.

Here, as follows If the app has a list of whether the account number exists or not There is no need to bother to inquire the database for the existence of the account number via the network.

account number
111-111-111
222-222-222
333-333-333
.........

** However, if the account number is huge, it is difficult for the smartphone app to retain that much information. ** ** ** In such a case, if you use Bloom Filter, you can have very compact information that "almost" tells you whether or not the account number exists. ** **

1-2. Mechanism

For more information, see [Wiki](https://en.wikipedia.org/wiki/%E3%83%96%E3%83%AB%E3%83%BC%E3%83%A0%E3%83%95% E3% 82% A3% E3% 83% AB% E3% 82% BF).

Prepare a record of a certain length that can have bit information of 0 and 1.
Apply a hash function to the account number and convert the resulting hash to a bit string of record length 01.
Set a bit at the corresponding position on the record

At this time, there are a plurality of positions where 1 stands.

0	0	0	0	0	1	1	0	0	0	0	0	1	0	0	0	0

If this is repeated for all account numbers, the number of positions where 1 stands will increase.

If you have this record when you enter the account number in the above app ** Exactly the same hash ⇒ Perform 01 bit string conversion and match the part where 1 stands with the record ** ** If all 1's are set, it means that the probability that the account number exists is extremely high. ** **

--Upper: Record with information on all account numbers --Lower: Hash ⇒ 01 conversion result of account number "111-111-111"

0	0	1	0	0	1	1	0	0	0	0	0	1	0	1	1	0
0	0	0	0	0	1	1	0	0	0	0	0	1	0	0	0	0

⇒ Since 1 is set in all the target positions, it is very likely that the record exists.

** "Extremely expensive" is not absolute. ** ** Because the position where 1 stands conflicts with other account numbers There is a possibility that 1 is set at that position by the bit string of other account number information.

--First line: Record with the following account number information --Second line: Bit information of the existing account number "111-111-111" --Third line: Bit information of existing account number "222-222-222"

0	0	1	0	0	1	1	0	0	0	0	0	1	0	1	1	0
0	0	0	0	0	1	1	0	0	0	0	0	1	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	1	0

Here, enter the non-existent account number "999-999-999" If the result of hash ⇒ 01 bit conversion is as follows, it is determined that the data exists by mistake.

--First line: Record with the following account number information (same as before) --Second line: Conversion result of non-existent account number "999-999-999"

0	0	1	0	0	1	1	0	0	0	0	0	1	0	1	1	0
0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	1	0

If there is even one 0 in the position where 1 stands, it can be 100% definitely determined that there is no data. Even if there is a 1 in the position where all the 1s stand, it cannot be 100% confirmed that there is data. ** We say that there are no false negatives, but there are false positives. ** **

Taking a long enough record will reduce false positives, If you take a record that is too long, the capacity will be squeezed.

** However, for 1 million account numbers (any number of digits) ** ** False positives will converge to about 1% if there is an area of about 1.2MB. ** ** ** If a region of about 1.8 MB can be prepared, false positives will converge to about 0.1%. ** **

So, with an app like the one above, you only need to keep a few records It will be possible to significantly suppress network access and database access.

In this case, every time a new account is added *
There is a point to download the updated record. *
When data is stored in multiple storages or segments *
It can also be used to efficiently search for data storage locations, but it is difficult to convey that *
I couldn't immediately think of any other good use cases that could be easily explained, so please forgive me. *

2. Bloom Filter implementation sample

I will post the implementation in Java as it is.

`MyBloomFilter.java`



import java.io.Serializable;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.BitSet;

/**
 *It has only the most basic functions of Bloom Filter
 * @author owl
 * @param <E>
 */
public class MyBasicBloomFilter<E> implements Serializable {

    private final BitSet bitFilter;                                             //Filter body
    private final int bitFilterSize;                                            //Filter size
    private int numberOfAddedElements;                                          //Current number of elements
    private final int countOfHashNumCreatePerOneElement;                        //Number of True bits per element (hashing ⇒ number of number conversions)

    static final Charset DEFAULT_CHARSET = Charset.forName("UTF-8");            

    /**
     *False positive level
     * High:High false positives, but compact filter size and good performance
     *Middle: Balance false positives with filter size and performance
     * Low:Keep false positives as low as possible even if the filter size is increased or the processing time is long.
     */
    public static enum FalsePositiveProbability{ High, Middle, Low }
    
    /**
     *Run message digest
     */
    static final MessageDigest digestFunction;
    static final String hashName = "MD5";                                       
    static {                                                                    
        MessageDigest tmp;
        try { tmp = java.security.MessageDigest.getInstance(hashName); } catch (NoSuchAlgorithmException e) { tmp = null; }
        digestFunction = tmp;
    }    

    /**
     *constructor
     * @param c Number of bits used per element
     * @param n (maximum expected number of elements)
     * @Number of True bits per param k element (hashing ⇒ number of number conversions)
     */
    public MyBasicBloomFilter(double c, int n, int k) {

        this.countOfHashNumCreatePerOneElement = k;
        
        //Element count*The bit size of Filter is determined by the number of bits used per element.
        this.bitFilterSize = (int)Math.ceil(c * n);
        
        numberOfAddedElements = 0;
        this.bitFilter = new BitSet(bitFilterSize);
        
    }


    /**
     * 
     * @param expectedItemCount
     * @param fpp 
     */
    public MyBasicBloomFilter(int expectedItemCount, FalsePositiveProbability fpp) {
        
        switch(fpp){
            case High : {
                this.bitFilterSize = (int)Math.ceil(expectedItemCount * 4.8); 
                break;
            }
            case Low : {
                this.bitFilterSize = (int)Math.ceil(expectedItemCount * (9.6 + 4.8));
                break;
            }
            default : {
                this.bitFilterSize = (int)Math.ceil(expectedItemCount * 9.6);
            }
        }
        this.countOfHashNumCreatePerOneElement
                = (int) Math.round((this.bitFilterSize / (double)expectedItemCount) * Math.log(2.0));
        numberOfAddedElements = 0;
        this.bitFilter = new BitSet(bitFilterSize);        
        
    }
    public MyBasicBloomFilter(int expectedItemCount) {
        this(expectedItemCount, FalsePositiveProbability.Middle);
    }
    
    /**
     *Hashing
     * @param data input data
     * @param hashCount Count
     * @return Numerical value obtained by hashing (bit standing position)
     */
    public static int[] createHashes(byte[] data, int hashCount) {
        int[] result = new int[hashCount];

        int k = 0;
        byte seed = 0;
        while (k < hashCount) {
            
            //Hashing
            byte[] digest;
            synchronized (digestFunction) {
                digestFunction.update(++seed);
                digest = digestFunction.digest(data);                
            }
        
            //Convert to int with 4Byte delimiter and pack in array
            for (int i = 0; i < digest.length/4 && k < hashCount; i++) {
                int h = 0;
                for (int j = (i*4); j < (i*4)+4; j++) {
                    h <<= 8;
                    h |= ((int) digest[j]) & 0xFF;
                }
                result[k] = h;
                k++;
            }
        }
        return result;
    }    
    
    /**
     *Add element
     * @param element 
     */
    public void add(E element) {
       add(element.toString().getBytes(DEFAULT_CHARSET));
    }    
    
    /**
     *Add element
     * @param bytes 
     */
    public void add(byte[] bytes) {
       int[] hashes = createHashes(bytes, countOfHashNumCreatePerOneElement);
       for (int hash : hashes) bitFilter.set(Math.abs(hash % bitFilterSize), true);
       numberOfAddedElements ++;
    }

    /**
     *Judgment
     * @param element element
     * @return 
     */
    public boolean contains(E element) {
        return contains(element.toString().getBytes(DEFAULT_CHARSET));
    }
    
    /**
     *Judgment
     * @Byte array of param bytes element
     * @return 
     */
    public boolean contains(byte[] bytes) {
        int[] hashes = createHashes(bytes, countOfHashNumCreatePerOneElement);
        for (int hash : hashes) {
            if (!bitFilter.get(Math.abs(hash % bitFilterSize))) {
                return false;
            }
        }
        return true;
    }    
    
    /**
     *Get filter
     * @return filter
     */
    public BitSet getBitSet() {
        return bitFilter;
    }

    /**
     *Get the number of added items
     * @number of return items
     */
    public int getAddItemCount(){
        return this.numberOfAddedElements;
    }
    
    public double getFalsePositiveProbability() {
        // (1 - e^(-k * n / m)) ^ k
        return 
            Math.pow((1 - Math.exp(-countOfHashNumCreatePerOneElement * (double) this.numberOfAddedElements
                / (double) bitFilterSize)), countOfHashNumCreatePerOneElement);
    }    
    
}

`Call sample`


import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang.RandomStringUtils;

/**
 *Try Bloom Filter
 * @author owl
 */
public class ExecBloomFilterSample {
    
    public static void main(String argv[]){
        MyBasicBloomFilter<String> bf = new MyBasicBloomFilter<>(1000000, MyBasicBloomFilter.FalsePositiveProbability.Low);

        List<String> addList = new ArrayList<>();           //Data to add to BloomFilter
        List<String> notAddList = new ArrayList<>();        //Data not added to Bloom Filter
        
        //Added 10,000 data to Bloom Filter
        //At the same time, generate 10,000 verification data that is not added to Bloom Filter.
        for(int i=0;i<2000000;i++){
            if(i % 2 == 0) {
                String s = RandomStringUtils.randomAlphabetic(24);
                bf.add(s);
                addList.add(s);
            } else {
                notAddList.add(RandomStringUtils.randomAlphabetic(24));
            }
        }

        //Output Bloom Filter status
        for(int i=0; i < bf.getBitSet().size(); i++ ){
            System.out.print(bf.getBitSet().get(i) ? "1" : "0");
            if(i>1000) break;
        }
        
        //Output Bloom Filter data size and calculated false positive probability
        System.out.println();
        System.out.println("Data size--> " + bf.getBitSet().size() + " | K-Byte number of bytes" + (double)((double)(bf.getBitSet().size())/8/1024));
        System.out.println("False positive probability--> " + String.format("%.5f", bf.getFalsePositiveProbability()));

        //Check if all the data added to Bloom Filter is positive
        //(If there is no problem, nothing is displayed)
        System.out.println();
        System.out.println("■■■■■ Start checking if all additional data is positive ■■■■■");
        addList
            .stream()
            .sequential()
            .filter(e -> !bf.contains(e))
            .forEach(e-> System.out.println("Missing judgment:" + e));
        
        //Outputs the count that data not added to Bloom Filter is falsely positive
        System.out.println("■■■■■ Start checking if non-additional data is judged as false positive ■■■■■");
        System.out.println("In 1000000 data..." +
            notAddList
                .stream()
                .sequential()
                .filter(e -> bf.contains(e))
                .count() + "A false positive was detected.");
        
    }
    
}

`Execution result`


000011100100001110011011100010101001100110111000100011001(abridgement)
Data size--> 14400000 | K-Byte number of bytes 1757.8125
False positive probability--> 0.00099

■■■■■ Start checking if all additional data is positive ■■■■■
■■■■■ Start checking if non-additional data is judged as false positive ■■■■■
In 1000000 data...994 false positives were detected.