In order to study algorithms and Java grammar, we have implemented the Boyer-Moore method, which is one of the string search algorithms! !!
Boyer-Moore Act (1976) Proposed by R.S.Boyer and J.S.Moore Algorithm faster than simple search algorithm Change the position to shift depending on the comparison result of characters Time complexity is less than the simple search algorithm
There are various other character string search algorithms such as the simple search method and the Rabin-Karp method.
Create a skip table before searching.
Search pattern: ighi
If so
pattern | i | g | h | i |
---|---|---|---|---|
Shift width | 3 | 2 | 1 | 0 |
However, if there are duplicate character strings, the smaller one has priority.
Therefore,
pattern | g | h | i | Default |
---|---|---|---|---|
Shift width | 2 | 1 | 0 | 4 |
It will be.
The default is provided to skip to the point where character comparisons do not overlap. This is the value when the pattern is counted as a natural number.
Search target | a | b | c | d | e | f | g | i | g | h | i |
---|---|---|---|---|---|---|---|---|---|---|---|
pattern | i | g | h | i |
The search starts from the beginning and compares the search target with the end of the pattern. Since the character string to be searched and the end of the search target are "d" and "i", skip the default "4".
Search target | a | b | c | d | e | i | g | i | g | h | i |
---|---|---|---|---|---|---|---|---|---|---|---|
pattern | i | g | h | i |
With "i" and "i", the Skip table is "0", so compare the adjacent characters. They are different because they are "g" and "h". Looking at the Skip table, when it is "g", Skip is "1". Skip only one.
Search target | a | b | c | d | e | i | g | i | g | h | i |
---|---|---|---|---|---|---|---|---|---|---|---|
pattern | i | g | h | i |
With "g" and "i", the Skip table is "2", so skip two.
Search target | a | b | c | d | e | i | g | i | g | h | i |
---|---|---|---|---|---|---|---|---|---|---|---|
pattern | i | g | h | i |
With "i" and "i", the Skip table is "0", so compare the adjacent strings. Since "h" and "h" are the same, compare the adjacent strings. Since "g" and "g" are the same, compare the adjacent strings. Since "i" and "i" are the same, the search is complete.
Now that you have an image, I'd like to implement it in code from here.
python
public final static int bmTableSize = 1_114_111;
public final static String txt = "Aiue Oppao"; //Search target
public final static char[] txtChar = txt.toCharArray(); //Split into char type
public final static String pattern = "Oppao"; //pattern
public final static char[] patternChar = pattern.toCharArray();
It was a little boring to be able to search only ASCII, so I'd like to support Japanese as well.
Looking at the Java reference Character type,
The Java SE API documentation uses Unicode code points for character values in the range U + 0000-U + 10FFFF and Unicode code units for 16-bit char values, which are UTF-16 encoding code units. For more information on Unicode terminology, see Unicode Glossary.
Character set: Unicode
And
Encoding scheme: UTF-16
You can see that
So I converted 10FFFF to decimal and set the table size (bmTableSize) to 1,114,111
.
BoyerMooreTableInit
public static int[] BoyerMooreTableInit(int[] table, char[] patternChar, int ptnLen) {
int count = 0;
/*For characters that are not in the pattern, shift the pattern character string length to the width.*/・ ・ ・ ①
for(count = 0; count < bmTableSize; count++){
table[count] = ptnLen;
}
/*Determine the shift width of the characters included in the pattern*/・ ・ ・ ②
for(count = 0; count < ptnLen; count++){
table[(int)patternChar[count]] = ptnLen - count - 1;
}
/*Debug output*/
System.out.printf("[table] : default: step=%d\n", ptnLen);
for(count = 0; count < bmTableSize; count++){
if(table[count] != ptnLen)
System.out.printf(" : char=%c: table[%03d]: step=%d\n",
(char)count,count,(int)table[count]);
}
return table;
}
Skip table,
pattern | O | Tsu | Pa | O |
---|---|---|---|---|
Shift width | 3 | 2 | 1 | 0 |
If there are duplicate characters, the smaller one will be given priority.
pattern | Tsu | Pa | O | Default |
---|---|---|---|---|
Shift width | 2 | 1 | 0 | 4 |
Eventually it will be like this.
Regarding (1), the default "4" is put in all the tables.
Regarding ②, table [(int) patternChar [count]] = ptnLen --count --1;
is
[O, pa, tsu, o] is included in patternChar [count]. If you cast this to an int type, the value converted from UTF-16 to decimal will be returned.
"Tsu" ... table[12387] = 2
"Pa" ... table[12401] = 1
"O" ... table[12362] = 0
"Default" ... table[Places other than the above three] = 4
BoyerMooreSearch
public static int BoyerMooreSearch(char[] txtChar, char[] patternChar) {
int table[] = new int[bmTableSize];
int txtLen = 0;
int ptnLen = 0;
int i = 0; /*Text comparison position*/
int j = 0; /*Pattern comparison position*/
txtLen = txtChar.length;
ptnLen = patternChar.length;
/*Create a shift table*/
table = BoyerMooreTableInit(table, patternChar, ptnLen);
/*Comparison processing*/
i = j = ptnLen - 1; /*Make the comparison position the end of the pattern*/
while((i < txtLen) && (j >= 0)){
PrintCompareProcess(txt, pattern, i, j);
if(txtChar[i] != patternChar[j]){
/*Refer to the shift table and set the next comparison position*/
i += next_step(table, txtChar[i], (ptnLen - j));
j = ptnLen - 1; /*Make the comparison position the end of the pattern*/
}else{
/*Since the characters match, collate the previous characters*/
j--;
i--;
}
}
if(j < 0) {
return i + 2;
}else {
return 0;
}
}
}
・ First comparison
Search target | Ah | I | U | e | Pa | Tsu | O | Tsu | Pa | O |
---|---|---|---|---|---|---|---|---|---|---|
pattern | O | Tsu | Pa | O |
Since the end "e" and "o" are different and the character string does not exist in the pattern, Skip 4 characters.
・ Second comparison
Search target | Ah | I | U | e | Pa | Tsu | O | Tsu | Pa | O |
---|---|---|---|---|---|---|---|---|---|---|
pattern | O | Tsu | Pa | O |
Since the end is the character that exists in the pattern and "tsu" = 2, Two-letter sentence is skipped.
・ Third comparison
Search target | Ah | I | U | e | Pa | Tsu | O | Tsu | Pa | O |
---|---|---|---|---|---|---|---|---|---|---|
pattern | O | Tsu | Pa | O |
Since the end is the character that exists in the pattern and "pa" = 1, One character sentence is skipped.
・ Fourth comparison
Search target | Ah | I | U | e | Pa | Tsu | O | Tsu | Pa | O |
---|---|---|---|---|---|---|---|---|---|---|
pattern | O | Tsu | Pa | O |
Since the end is "O" = 0, search for the next character.
・ ・ ・ Repeat below
In the end, it looks like this.
Main.java
import java.io.UnsupportedEncodingException;
public class Main {
public final static int bmTableSize = 1_114_111;
public final static String txt = "Aiue Pappa";
public final static char[] txtChar = txt.toCharArray();
public final static String pattern = "Oppao";
public final static char[] patternChar = pattern.toCharArray();
public static void main(String[] args) throws UnsupportedEncodingException {
int result;
System.out.printf("[text] :%s\n", txt);
System.out.printf("[pattern]:%s\n", pattern);
result = BoyerMooreSearch(txtChar, patternChar);
if (0 < result) {
System.out.println(result + "It was second! !!");
}else {
System.out.println("I couldn't find it");
}
}
public static int[] BoyerMooreTableInit(int[] table, char[] patternChar, int ptnLen) {
int count = 0;
/*For characters that are not in the pattern, shift the pattern character string length to the width.*/
for(count = 0; count < bmTableSize; count++){
table[count] = ptnLen;
}
/*Determine the shift width of the characters included in the pattern*/
for(count = 0; count < ptnLen; count++){
table[(int)patternChar[count]] = ptnLen - count - 1;
}
/*Debug output*/
System.out.printf("[table] : default: step=%d\n", ptnLen);
for(count = 0; count < bmTableSize; count++){
if(table[count] != ptnLen)
System.out.printf(" : char=%c: table[%03d]: step=%d\n",
(char)count,count,(int)table[count]);
}
return table;
}
public static void PrintCompareProcess(String txt, String pattern, int i, int j) {
int count = 0;
System.out.printf("-----------------------------------\n");
System.out.printf("[compare]:(text i=%d)(pattern j=%d)\n", i+1, j+1);
System.out.printf(" text :%s\n", txt);
/*Overlay patterns at comparison positions*/
System.out.printf(" pattern :");
for(count = 0; count < (i - j); count++) System.out.print(" "); //It shifts in full-width and half-width.
System.out.printf("%s\n", pattern);
/*Mark the comparison point*/
System.out.printf(" :");
for(count = 0; count < i; count++) System.out.printf(" ");
System.out.printf("^\n");
}
public static int next_step(int[] table, char target, int remain) {
/*Check to prevent loops*/
if(table[(int)target] > remain){
/*Get the value from the staggered table*/
return(table[(int)target]);
}else{
/*Advance to the next character at the point where the collation started*/
return(remain);
}
}
public static int BoyerMooreSearch(char[] txtChar, char[] patternChar) {
int table[] = new int[bmTableSize];
int txtLen = 0;
int ptnLen = 0;
int i = 0; /*Text comparison position*/
int j = 0; /*Pattern comparison position*/
txtLen = txtChar.length;
ptnLen = patternChar.length;
/*Create a shift table*/
table = BoyerMooreTableInit(table, patternChar, ptnLen);
/*Comparison processing*/
i = j = ptnLen - 1; /*Make the comparison position the end of the pattern*/
while((i < txtLen) && (j >= 0)){
PrintCompareProcess(txt, pattern, i, j);
if(txtChar[i] != patternChar[j]){
/*Refer to the shift table and set the next comparison position*/
i += next_step(table, txtChar[i], (ptnLen - j));
j = ptnLen - 1; /*Make the comparison position the end of the pattern*/
}else{
/*Since the characters match, collate the previous characters*/
j--;
i--;
}
}
if(j < 0) {
return i + 2;
}else {
return 0;
}
}
}
[text] :Aiue Pappa
[pattern]:Oppao
[table] : default: step=4
: char=O: table[12362]: step=0
: char=Tsu: table[12387]: step=2
: char=Pa: table[12401]: step=1
-----------------------------------
[compare]:(text i=4)(pattern j=4)
text :Aiue Pappa
pattern :Oppao
: ^
-----------------------------------
[compare]:(text i=8)(pattern j=4)
text :Aiue Pappa
pattern :Oppao
: ^
-----------------------------------
[compare]:(text i=10)(pattern j=4)
text :Aiue Pappa
pattern :Oppao
: ^
-----------------------------------
[compare]:(text i=9)(pattern j=3)
text :Aiue Pappa
pattern :Oppao
: ^
-----------------------------------
[compare]:(text i=8)(pattern j=2)
text :Aiue Pappa
pattern :Oppao
: ^
-----------------------------------
[compare]:(text i=7)(pattern j=1)
text :Aiue Pappa
pattern :Oppao
: ^
It was 7th! !!
I hope you find it helpful.
-Java Character type reference ・ [C Language Algorithm-BM Method](http://capm-network.com/?tag=C%E8%A8%80%E8%AA%9E%E3%82%A2%E3%83%AB%E3 % 82% B4% E3% 83% AA% E3% 82% BA% E3% 83% A0-BM% E6% B3% 95) -Unicode-compatible character code table -[Java] char type can be cast to int type
Recommended Posts