It's a little difficult, so I'll give you an example. It's neither a proper use case nor a reality.
In this case
Will be.
Here, as follows If the app has a list of whether the account number exists or not There is no need to bother to inquire the database for the existence of the account number via the network.
| account number |
|---|
| 111-111-111 |
| 222-222-222 |
| 333-333-333 |
| ......... |
** However, if the account number is huge, it is difficult for the smartphone app to retain that much information. ** ** ** In such a case, if you use Bloom Filter, you can have very compact information that "almost" tells you whether or not the account number exists. ** **
For more information, see [Wiki](https://en.wikipedia.org/wiki/%E3%83%96%E3%83%AB%E3%83%BC%E3%83%A0%E3%83%95% E3% 82% A3% E3% 83% AB% E3% 82% BF).
At this time, there are a plurality of positions where 1 stands.
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
|---|
If this is repeated for all account numbers, the number of positions where 1 stands will increase.
If you have this record when you enter the account number in the above app ** Exactly the same hash ⇒ Perform 01 bit string conversion and match the part where 1 stands with the record ** ** If all 1's are set, it means that the probability that the account number exists is extremely high. ** **
--Upper: Record with information on all account numbers --Lower: Hash ⇒ 01 conversion result of account number "111-111-111"
| 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
⇒ Since 1 is set in all the target positions, it is very likely that the record exists.
** "Extremely expensive" is not absolute. ** ** Because the position where 1 stands conflicts with other account numbers There is a possibility that 1 is set at that position by the bit string of other account number information.
--First line: Record with the following account number information --Second line: Bit information of the existing account number "111-111-111" --Third line: Bit information of existing account number "222-222-222"
| 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
Here, enter the non-existent account number "999-999-999" If the result of hash ⇒ 01 bit conversion is as follows, it is determined that the data exists by mistake.
--First line: Record with the following account number information (same as before) --Second line: Conversion result of non-existent account number "999-999-999"
| 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
If there is even one 0 in the position where 1 stands, it can be 100% definitely determined that there is no data. Even if there is a 1 in the position where all the 1s stand, it cannot be 100% confirmed that there is data. ** We say that there are no false negatives, but there are false positives. ** **
Taking a long enough record will reduce false positives, If you take a record that is too long, the capacity will be squeezed.
** However, for 1 million account numbers (any number of digits) ** ** False positives will converge to about 1% if there is an area of about 1.2MB. ** ** ** If a region of about 1.8 MB can be prepared, false positives will converge to about 0.1%. ** **
So, with an app like the one above, you only need to keep a few records It will be possible to significantly suppress network access and database access.
In this case, every time a new account is added *
There is a point to download the updated record. *
When data is stored in multiple storages or segments *
It can also be used to efficiently search for data storage locations, but it is difficult to convey that *
I couldn't immediately think of any other good use cases that could be easily explained, so please forgive me. *
I will post the implementation in Java as it is.
MyBloomFilter.java
import java.io.Serializable;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.BitSet;
/**
*It has only the most basic functions of Bloom Filter
* @author owl
* @param <E>
*/
public class MyBasicBloomFilter<E> implements Serializable {
private final BitSet bitFilter; //Filter body
private final int bitFilterSize; //Filter size
private int numberOfAddedElements; //Current number of elements
private final int countOfHashNumCreatePerOneElement; //Number of True bits per element (hashing ⇒ number of number conversions)
static final Charset DEFAULT_CHARSET = Charset.forName("UTF-8");
/**
*False positive level
* High:High false positives, but compact filter size and good performance
*Middle: Balance false positives with filter size and performance
* Low:Keep false positives as low as possible even if the filter size is increased or the processing time is long.
*/
public static enum FalsePositiveProbability{ High, Middle, Low }
/**
*Run message digest
*/
static final MessageDigest digestFunction;
static final String hashName = "MD5";
static {
MessageDigest tmp;
try { tmp = java.security.MessageDigest.getInstance(hashName); } catch (NoSuchAlgorithmException e) { tmp = null; }
digestFunction = tmp;
}
/**
*constructor
* @param c Number of bits used per element
* @param n (maximum expected number of elements)
* @Number of True bits per param k element (hashing ⇒ number of number conversions)
*/
public MyBasicBloomFilter(double c, int n, int k) {
this.countOfHashNumCreatePerOneElement = k;
//Element count*The bit size of Filter is determined by the number of bits used per element.
this.bitFilterSize = (int)Math.ceil(c * n);
numberOfAddedElements = 0;
this.bitFilter = new BitSet(bitFilterSize);
}
/**
*
* @param expectedItemCount
* @param fpp
*/
public MyBasicBloomFilter(int expectedItemCount, FalsePositiveProbability fpp) {
switch(fpp){
case High : {
this.bitFilterSize = (int)Math.ceil(expectedItemCount * 4.8);
break;
}
case Low : {
this.bitFilterSize = (int)Math.ceil(expectedItemCount * (9.6 + 4.8));
break;
}
default : {
this.bitFilterSize = (int)Math.ceil(expectedItemCount * 9.6);
}
}
this.countOfHashNumCreatePerOneElement
= (int) Math.round((this.bitFilterSize / (double)expectedItemCount) * Math.log(2.0));
numberOfAddedElements = 0;
this.bitFilter = new BitSet(bitFilterSize);
}
public MyBasicBloomFilter(int expectedItemCount) {
this(expectedItemCount, FalsePositiveProbability.Middle);
}
/**
*Hashing
* @param data input data
* @param hashCount Count
* @return Numerical value obtained by hashing (bit standing position)
*/
public static int[] createHashes(byte[] data, int hashCount) {
int[] result = new int[hashCount];
int k = 0;
byte seed = 0;
while (k < hashCount) {
//Hashing
byte[] digest;
synchronized (digestFunction) {
digestFunction.update(++seed);
digest = digestFunction.digest(data);
}
//Convert to int with 4Byte delimiter and pack in array
for (int i = 0; i < digest.length/4 && k < hashCount; i++) {
int h = 0;
for (int j = (i*4); j < (i*4)+4; j++) {
h <<= 8;
h |= ((int) digest[j]) & 0xFF;
}
result[k] = h;
k++;
}
}
return result;
}
/**
*Add element
* @param element
*/
public void add(E element) {
add(element.toString().getBytes(DEFAULT_CHARSET));
}
/**
*Add element
* @param bytes
*/
public void add(byte[] bytes) {
int[] hashes = createHashes(bytes, countOfHashNumCreatePerOneElement);
for (int hash : hashes) bitFilter.set(Math.abs(hash % bitFilterSize), true);
numberOfAddedElements ++;
}
/**
*Judgment
* @param element element
* @return
*/
public boolean contains(E element) {
return contains(element.toString().getBytes(DEFAULT_CHARSET));
}
/**
*Judgment
* @Byte array of param bytes element
* @return
*/
public boolean contains(byte[] bytes) {
int[] hashes = createHashes(bytes, countOfHashNumCreatePerOneElement);
for (int hash : hashes) {
if (!bitFilter.get(Math.abs(hash % bitFilterSize))) {
return false;
}
}
return true;
}
/**
*Get filter
* @return filter
*/
public BitSet getBitSet() {
return bitFilter;
}
/**
*Get the number of added items
* @number of return items
*/
public int getAddItemCount(){
return this.numberOfAddedElements;
}
public double getFalsePositiveProbability() {
// (1 - e^(-k * n / m)) ^ k
return
Math.pow((1 - Math.exp(-countOfHashNumCreatePerOneElement * (double) this.numberOfAddedElements
/ (double) bitFilterSize)), countOfHashNumCreatePerOneElement);
}
}
Call sample
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang.RandomStringUtils;
/**
*Try Bloom Filter
* @author owl
*/
public class ExecBloomFilterSample {
public static void main(String argv[]){
MyBasicBloomFilter<String> bf = new MyBasicBloomFilter<>(1000000, MyBasicBloomFilter.FalsePositiveProbability.Low);
List<String> addList = new ArrayList<>(); //Data to add to BloomFilter
List<String> notAddList = new ArrayList<>(); //Data not added to Bloom Filter
//Added 10,000 data to Bloom Filter
//At the same time, generate 10,000 verification data that is not added to Bloom Filter.
for(int i=0;i<2000000;i++){
if(i % 2 == 0) {
String s = RandomStringUtils.randomAlphabetic(24);
bf.add(s);
addList.add(s);
} else {
notAddList.add(RandomStringUtils.randomAlphabetic(24));
}
}
//Output Bloom Filter status
for(int i=0; i < bf.getBitSet().size(); i++ ){
System.out.print(bf.getBitSet().get(i) ? "1" : "0");
if(i>1000) break;
}
//Output Bloom Filter data size and calculated false positive probability
System.out.println();
System.out.println("Data size--> " + bf.getBitSet().size() + " | K-Byte number of bytes" + (double)((double)(bf.getBitSet().size())/8/1024));
System.out.println("False positive probability--> " + String.format("%.5f", bf.getFalsePositiveProbability()));
//Check if all the data added to Bloom Filter is positive
//(If there is no problem, nothing is displayed)
System.out.println();
System.out.println("■■■■■ Start checking if all additional data is positive ■■■■■");
addList
.stream()
.sequential()
.filter(e -> !bf.contains(e))
.forEach(e-> System.out.println("Missing judgment:" + e));
//Outputs the count that data not added to Bloom Filter is falsely positive
System.out.println("■■■■■ Start checking if non-additional data is judged as false positive ■■■■■");
System.out.println("In 1000000 data..." +
notAddList
.stream()
.sequential()
.filter(e -> bf.contains(e))
.count() + "A false positive was detected.");
}
}
Execution result
000011100100001110011011100010101001100110111000100011001(abridgement)
Data size--> 14400000 | K-Byte number of bytes 1757.8125
False positive probability--> 0.00099
■■■■■ Start checking if all additional data is positive ■■■■■
■■■■■ Start checking if non-additional data is judged as false positive ■■■■■
In 1000000 data...994 false positives were detected.
Recommended Posts