It's a little difficult, so I'll give you an example. It's neither a proper use case nor a reality.
In this case
Will be.
Here, as follows If the app has a list of whether the account number exists or not There is no need to bother to inquire the database for the existence of the account number via the network.
account number |
---|
111-111-111 |
222-222-222 |
333-333-333 |
......... |
** However, if the account number is huge, it is difficult for the smartphone app to retain that much information. ** ** ** In such a case, if you use Bloom Filter, you can have very compact information that "almost" tells you whether or not the account number exists. ** **
For more information, see [Wiki](https://en.wikipedia.org/wiki/%E3%83%96%E3%83%AB%E3%83%BC%E3%83%A0%E3%83%95% E3% 82% A3% E3% 83% AB% E3% 82% BF).
At this time, there are a plurality of positions where 1 stands.
0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
---|
If this is repeated for all account numbers, the number of positions where 1 stands will increase.
If you have this record when you enter the account number in the above app ** Exactly the same hash ⇒ Perform 01 bit string conversion and match the part where 1 stands with the record ** ** If all 1's are set, it means that the probability that the account number exists is extremely high. ** **
--Upper: Record with information on all account numbers --Lower: Hash ⇒ 01 conversion result of account number "111-111-111"
0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
⇒ Since 1 is set in all the target positions, it is very likely that the record exists.
** "Extremely expensive" is not absolute. ** ** Because the position where 1 stands conflicts with other account numbers There is a possibility that 1 is set at that position by the bit string of other account number information.
--First line: Record with the following account number information --Second line: Bit information of the existing account number "111-111-111" --Third line: Bit information of existing account number "222-222-222"
0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
Here, enter the non-existent account number "999-999-999" If the result of hash ⇒ 01 bit conversion is as follows, it is determined that the data exists by mistake.
--First line: Record with the following account number information (same as before) --Second line: Conversion result of non-existent account number "999-999-999"
0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
If there is even one 0 in the position where 1 stands, it can be 100% definitely determined that there is no data. Even if there is a 1 in the position where all the 1s stand, it cannot be 100% confirmed that there is data. ** We say that there are no false negatives, but there are false positives. ** **
Taking a long enough record will reduce false positives, If you take a record that is too long, the capacity will be squeezed.
** However, for 1 million account numbers (any number of digits) ** ** False positives will converge to about 1% if there is an area of about 1.2MB. ** ** ** If a region of about 1.8 MB can be prepared, false positives will converge to about 0.1%. ** **
So, with an app like the one above, you only need to keep a few records It will be possible to significantly suppress network access and database access.
In this case, every time a new account is added *
There is a point to download the updated record. *
When data is stored in multiple storages or segments *
It can also be used to efficiently search for data storage locations, but it is difficult to convey that *
I couldn't immediately think of any other good use cases that could be easily explained, so please forgive me. *
I will post the implementation in Java as it is.
MyBloomFilter.java
import java.io.Serializable;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.BitSet;
/**
*It has only the most basic functions of Bloom Filter
* @author owl
* @param <E>
*/
public class MyBasicBloomFilter<E> implements Serializable {
private final BitSet bitFilter; //Filter body
private final int bitFilterSize; //Filter size
private int numberOfAddedElements; //Current number of elements
private final int countOfHashNumCreatePerOneElement; //Number of True bits per element (hashing ⇒ number of number conversions)
static final Charset DEFAULT_CHARSET = Charset.forName("UTF-8");
/**
*False positive level
* High:High false positives, but compact filter size and good performance
*Middle: Balance false positives with filter size and performance
* Low:Keep false positives as low as possible even if the filter size is increased or the processing time is long.
*/
public static enum FalsePositiveProbability{ High, Middle, Low }
/**
*Run message digest
*/
static final MessageDigest digestFunction;
static final String hashName = "MD5";
static {
MessageDigest tmp;
try { tmp = java.security.MessageDigest.getInstance(hashName); } catch (NoSuchAlgorithmException e) { tmp = null; }
digestFunction = tmp;
}
/**
*constructor
* @param c Number of bits used per element
* @param n (maximum expected number of elements)
* @Number of True bits per param k element (hashing ⇒ number of number conversions)
*/
public MyBasicBloomFilter(double c, int n, int k) {
this.countOfHashNumCreatePerOneElement = k;
//Element count*The bit size of Filter is determined by the number of bits used per element.
this.bitFilterSize = (int)Math.ceil(c * n);
numberOfAddedElements = 0;
this.bitFilter = new BitSet(bitFilterSize);
}
/**
*
* @param expectedItemCount
* @param fpp
*/
public MyBasicBloomFilter(int expectedItemCount, FalsePositiveProbability fpp) {
switch(fpp){
case High : {
this.bitFilterSize = (int)Math.ceil(expectedItemCount * 4.8);
break;
}
case Low : {
this.bitFilterSize = (int)Math.ceil(expectedItemCount * (9.6 + 4.8));
break;
}
default : {
this.bitFilterSize = (int)Math.ceil(expectedItemCount * 9.6);
}
}
this.countOfHashNumCreatePerOneElement
= (int) Math.round((this.bitFilterSize / (double)expectedItemCount) * Math.log(2.0));
numberOfAddedElements = 0;
this.bitFilter = new BitSet(bitFilterSize);
}
public MyBasicBloomFilter(int expectedItemCount) {
this(expectedItemCount, FalsePositiveProbability.Middle);
}
/**
*Hashing
* @param data input data
* @param hashCount Count
* @return Numerical value obtained by hashing (bit standing position)
*/
public static int[] createHashes(byte[] data, int hashCount) {
int[] result = new int[hashCount];
int k = 0;
byte seed = 0;
while (k < hashCount) {
//Hashing
byte[] digest;
synchronized (digestFunction) {
digestFunction.update(++seed);
digest = digestFunction.digest(data);
}
//Convert to int with 4Byte delimiter and pack in array
for (int i = 0; i < digest.length/4 && k < hashCount; i++) {
int h = 0;
for (int j = (i*4); j < (i*4)+4; j++) {
h <<= 8;
h |= ((int) digest[j]) & 0xFF;
}
result[k] = h;
k++;
}
}
return result;
}
/**
*Add element
* @param element
*/
public void add(E element) {
add(element.toString().getBytes(DEFAULT_CHARSET));
}
/**
*Add element
* @param bytes
*/
public void add(byte[] bytes) {
int[] hashes = createHashes(bytes, countOfHashNumCreatePerOneElement);
for (int hash : hashes) bitFilter.set(Math.abs(hash % bitFilterSize), true);
numberOfAddedElements ++;
}
/**
*Judgment
* @param element element
* @return
*/
public boolean contains(E element) {
return contains(element.toString().getBytes(DEFAULT_CHARSET));
}
/**
*Judgment
* @Byte array of param bytes element
* @return
*/
public boolean contains(byte[] bytes) {
int[] hashes = createHashes(bytes, countOfHashNumCreatePerOneElement);
for (int hash : hashes) {
if (!bitFilter.get(Math.abs(hash % bitFilterSize))) {
return false;
}
}
return true;
}
/**
*Get filter
* @return filter
*/
public BitSet getBitSet() {
return bitFilter;
}
/**
*Get the number of added items
* @number of return items
*/
public int getAddItemCount(){
return this.numberOfAddedElements;
}
public double getFalsePositiveProbability() {
// (1 - e^(-k * n / m)) ^ k
return
Math.pow((1 - Math.exp(-countOfHashNumCreatePerOneElement * (double) this.numberOfAddedElements
/ (double) bitFilterSize)), countOfHashNumCreatePerOneElement);
}
}
Call sample
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang.RandomStringUtils;
/**
*Try Bloom Filter
* @author owl
*/
public class ExecBloomFilterSample {
public static void main(String argv[]){
MyBasicBloomFilter<String> bf = new MyBasicBloomFilter<>(1000000, MyBasicBloomFilter.FalsePositiveProbability.Low);
List<String> addList = new ArrayList<>(); //Data to add to BloomFilter
List<String> notAddList = new ArrayList<>(); //Data not added to Bloom Filter
//Added 10,000 data to Bloom Filter
//At the same time, generate 10,000 verification data that is not added to Bloom Filter.
for(int i=0;i<2000000;i++){
if(i % 2 == 0) {
String s = RandomStringUtils.randomAlphabetic(24);
bf.add(s);
addList.add(s);
} else {
notAddList.add(RandomStringUtils.randomAlphabetic(24));
}
}
//Output Bloom Filter status
for(int i=0; i < bf.getBitSet().size(); i++ ){
System.out.print(bf.getBitSet().get(i) ? "1" : "0");
if(i>1000) break;
}
//Output Bloom Filter data size and calculated false positive probability
System.out.println();
System.out.println("Data size--> " + bf.getBitSet().size() + " | K-Byte number of bytes" + (double)((double)(bf.getBitSet().size())/8/1024));
System.out.println("False positive probability--> " + String.format("%.5f", bf.getFalsePositiveProbability()));
//Check if all the data added to Bloom Filter is positive
//(If there is no problem, nothing is displayed)
System.out.println();
System.out.println("■■■■■ Start checking if all additional data is positive ■■■■■");
addList
.stream()
.sequential()
.filter(e -> !bf.contains(e))
.forEach(e-> System.out.println("Missing judgment:" + e));
//Outputs the count that data not added to Bloom Filter is falsely positive
System.out.println("■■■■■ Start checking if non-additional data is judged as false positive ■■■■■");
System.out.println("In 1000000 data..." +
notAddList
.stream()
.sequential()
.filter(e -> bf.contains(e))
.count() + "A false positive was detected.");
}
}
Execution result
000011100100001110011011100010101001100110111000100011001(abridgement)
Data size--> 14400000 | K-Byte number of bytes 1757.8125
False positive probability--> 0.00099
■■■■■ Start checking if all additional data is positive ■■■■■
■■■■■ Start checking if non-additional data is judged as false positive ■■■■■
In 1000000 data...994 false positives were detected.
Recommended Posts