In order to study algorithms and Java grammar, we have implemented the Boyer-Moore method, which is one of the string search algorithms! !!

What is the Boyer-Moore method?

Boyer-Moore Act (1976) Proposed by R.S.Boyer and J.S.Moore Algorithm faster than simple search algorithm Change the position to shift depending on the comparison result of characters Time complexity is less than the simple search algorithm

There are various other character string search algorithms such as the simple search method and the Rabin-Karp method.

Skip table

Create a skip table before searching.

Search pattern: ighi

If so

pattern	i	g	h	i
Shift width	3	2	1	0

However, if there are duplicate character strings, the smaller one has priority.

Therefore,

pattern	g	h	i	Default
Shift width	2	1	0	4

It will be.

The default is provided to skip to the point where character comparisons do not overlap. This is the value when the pattern is counted as a natural number.

Search example

First time

Search target	a	b	c	d	e	f	g	i	g	h	i
pattern	i	g	h	i

The search starts from the beginning and compares the search target with the end of the pattern. Since the character string to be searched and the end of the search target are "d" and "i", skip the default "4".

Second time

Search target	a	b	c	d	e	i	g	i	g	h	i
pattern					i	g	h	i

With "i" and "i", the Skip table is "0", so compare the adjacent characters. They are different because they are "g" and "h". Looking at the Skip table, when it is "g", Skip is "1". Skip only one.

Third time

Search target	a	b	c	d	e	i	g	i	g	h	i
pattern						i	g	h	i

With "g" and "i", the Skip table is "2", so skip two.

4th

Search target	a	b	c	d	e	i	g	i	g	h	i
pattern								i	g	h	i

With "i" and "i", the Skip table is "0", so compare the adjacent strings. Since "h" and "h" are the same, compare the adjacent strings. Since "g" and "g" are the same, compare the adjacent strings. Since "i" and "i" are the same, the search is complete.

Now that you have an image, I'd like to implement it in code from here.

Variable definition

`python`


public final static int bmTableSize = 1_114_111;
public final static String txt = "Aiue Oppao"; //Search target
public final static char[] txtChar = txt.toCharArray(); //Split into char type
public final static String pattern = "Oppao"; //pattern
public final static char[] patternChar = pattern.toCharArray();

It was a little boring to be able to search only ASCII, so I'd like to support Japanese as well.

Looking at the Java reference Character type,

The Java SE API documentation uses Unicode code points for character values in the range U + 0000-U + 10FFFF and Unicode code units for 16-bit char values, which are UTF-16 encoding code units. For more information on Unicode terminology, see Unicode Glossary.

Character set: Unicode And Encoding scheme: UTF-16 You can see that

So I converted 10FFFF to decimal and set the table size (bmTableSize) to 1,114,111.

Create table

`BoyerMooreTableInit`


public static int[] BoyerMooreTableInit(int[] table, char[] patternChar, int ptnLen) {
        int count = 0;

        /*For characters that are not in the pattern, shift the pattern character string length to the width.*/・ ・ ・ ①
        for(count = 0; count < bmTableSize; count++){
            table[count] = ptnLen;
        }

        /*Determine the shift width of the characters included in the pattern*/・ ・ ・ ②
        for(count = 0; count < ptnLen; count++){
            table[(int)patternChar[count]] = ptnLen - count - 1;
        }

        /*Debug output*/
        System.out.printf("[table]  : default: step=%d\n", ptnLen);
        for(count = 0; count < bmTableSize; count++){
            if(table[count] != ptnLen)
                System.out.printf("         : char=%c: table[%03d]: step=%d\n",
                                 (char)count,count,(int)table[count]);
        }
   return table;
}

Skip table,

pattern	O	Tsu	Pa	O
Shift width	3	2	1	0

If there are duplicate characters, the smaller one will be given priority.

pattern	Tsu	Pa	O	Default
Shift width	2	1	0	4

Eventually it will be like this.

Regarding (1), the default "4" is put in all the tables.

Regarding ②, table [(int) patternChar [count]] = ptnLen --count --1; is

[O, pa, tsu, o] is included in patternChar [count]. If you cast this to an int type, the value converted from UTF-16 to decimal will be returned.

"Tsu" ... table[12387] = 2
"Pa" ... table[12401] = 1
"O" ... table[12362] = 0
"Default" ... table[Places other than the above three] = 4

Search implementation

`BoyerMooreSearch`


public static int BoyerMooreSearch(char[] txtChar, char[] patternChar) {
        int table[] = new int[bmTableSize];
        int txtLen = 0;
        int ptnLen = 0;
        int i = 0; /*Text comparison position*/
        int j = 0; /*Pattern comparison position*/

        txtLen = txtChar.length;
        ptnLen = patternChar.length;

        /*Create a shift table*/
        table = BoyerMooreTableInit(table, patternChar, ptnLen);

        /*Comparison processing*/
        i = j = ptnLen - 1; /*Make the comparison position the end of the pattern*/
        while((i < txtLen) && (j >= 0)){
            PrintCompareProcess(txt, pattern, i, j);

            if(txtChar[i] != patternChar[j]){
                /*Refer to the shift table and set the next comparison position*/
                i += next_step(table, txtChar[i], (ptnLen - j));
                j = ptnLen - 1;   /*Make the comparison position the end of the pattern*/
            }else{
                /*Since the characters match, collate the previous characters*/
                j--;
                i--;
            }
        }

        if(j < 0) {
            return i + 2;
        }else {
            return 0;
        }
    }
}

・ First comparison

Search target	Ah	I	U	e	Pa	Tsu	O	Tsu	Pa	O
pattern	O	Tsu	Pa	O

Since the end "e" and "o" are different and the character string does not exist in the pattern, Skip 4 characters.

・ Second comparison

Search target	Ah	I	U	e	Pa	Tsu	O	Tsu	Pa	O
pattern					O	Tsu	Pa	O

Since the end is the character that exists in the pattern and "tsu" = 2, Two-letter sentence is skipped.

・ Third comparison

Search target	Ah	I	U	e	Pa	Tsu	O	Tsu	Pa	O
pattern							O	Tsu	Pa	O

Since the end is the character that exists in the pattern and "pa" = 1, One character sentence is skipped.

・ Fourth comparison

Search target	Ah	I	U	e	Pa	Tsu	O	Tsu	Pa	O
pattern							O	Tsu	Pa	O

Since the end is "O" = 0, search for the next character.

・・・ Repeat below

In the end, it looks like this.

code

`Main.java`


import java.io.UnsupportedEncodingException;

public class Main {
	public final static int bmTableSize = 1_114_111;
	public final static String txt = "Aiue Pappa";
	public final static char[] txtChar = txt.toCharArray();
	public final static String pattern = "Oppao";
	public final static char[] patternChar = pattern.toCharArray();

	public static void main(String[] args) throws UnsupportedEncodingException {
		int result;

		System.out.printf("[text]   :%s\n", txt);
		System.out.printf("[pattern]:%s\n", pattern);

		result = BoyerMooreSearch(txtChar, patternChar);

		if (0 < result) {
			System.out.println(result + "It was second! !!");
		}else {
			System.out.println("I couldn't find it");
		}

	}

	public static int[] BoyerMooreTableInit(int[] table, char[] patternChar, int ptnLen) {
		int count = 0;

	    /*For characters that are not in the pattern, shift the pattern character string length to the width.*/
	    for(count = 0; count < bmTableSize; count++){
	    	table[count] = ptnLen;
	    }

	    /*Determine the shift width of the characters included in the pattern*/
	    for(count = 0; count < ptnLen; count++){
	    	table[(int)patternChar[count]] = ptnLen - count - 1;
	    }

	    /*Debug output*/
	    System.out.printf("[table]  : default: step=%d\n", ptnLen);
	    for(count = 0; count < bmTableSize; count++){
	        if(table[count] != ptnLen)
	        	System.out.printf("         : char=%c: table[%03d]: step=%d\n",
	        					 (char)count,count,(int)table[count]);
	    }

	    return table;
	}

	public static void PrintCompareProcess(String txt, String pattern, int i, int j) {
	    int count = 0;

	    System.out.printf("-----------------------------------\n");
	    System.out.printf("[compare]:(text i=%d)(pattern j=%d)\n", i+1, j+1);
	    System.out.printf(" text    :%s\n", txt);

	    /*Overlay patterns at comparison positions*/
	    System.out.printf(" pattern :");
	    for(count = 0; count < (i - j); count++) System.out.print("　"); //It shifts in full-width and half-width.
	    System.out.printf("%s\n", pattern);

	    /*Mark the comparison point*/
	    System.out.printf("         :");
	    for(count = 0; count < i; count++) System.out.printf("　");
	    System.out.printf("^\n");
	}

	public static int next_step(int[] table, char target, int remain) {
	    /*Check to prevent loops*/
	    if(table[(int)target] > remain){
	        /*Get the value from the staggered table*/
	        return(table[(int)target]);
	    }else{
	        /*Advance to the next character at the point where the collation started*/
	        return(remain);
	    }
	}

	public static int BoyerMooreSearch(char[] txtChar, char[] patternChar) {
		int table[] = new int[bmTableSize];
		int txtLen = 0;
		int ptnLen = 0;
		int i = 0; /*Text comparison position*/
		int j = 0; /*Pattern comparison position*/

		txtLen = txtChar.length;
		ptnLen = patternChar.length;

	    /*Create a shift table*/
		table = BoyerMooreTableInit(table, patternChar, ptnLen);

	    /*Comparison processing*/
	    i = j = ptnLen - 1; /*Make the comparison position the end of the pattern*/
	    while((i < txtLen) && (j >= 0)){
	    	PrintCompareProcess(txt, pattern, i, j);

	        if(txtChar[i] != patternChar[j]){
	            /*Refer to the shift table and set the next comparison position*/
	            i += next_step(table, txtChar[i], (ptnLen - j));
	            j = ptnLen - 1;   /*Make the comparison position the end of the pattern*/
	        }else{
	            /*Since the characters match, collate the previous characters*/
	            j--;
	            i--;
	        }
	    }

	    if(j < 0) {
	    	return i + 2;
	    }else {
		    return 0;
	    }
	}
}

Output result

[text]   :Aiue Pappa
[pattern]:Oppao
[table]  : default: step=4
         : char=O: table[12362]: step=0
         : char=Tsu: table[12387]: step=2
         : char=Pa: table[12401]: step=1
-----------------------------------
[compare]:(text i=4)(pattern j=4)
 text    :Aiue Pappa
 pattern :Oppao
         :　　　^
-----------------------------------
[compare]:(text i=8)(pattern j=4)
 text    :Aiue Pappa
 pattern :Oppao
         :　　　　　　　^
-----------------------------------
[compare]:(text i=10)(pattern j=4)
 text    :Aiue Pappa
 pattern :Oppao
         :　　　　　　　　　^
-----------------------------------
[compare]:(text i=9)(pattern j=3)
 text    :Aiue Pappa
 pattern :Oppao
         :　　　　　　　　^
-----------------------------------
[compare]:(text i=8)(pattern j=2)
 text    :Aiue Pappa
 pattern :Oppao
         :　　　　　　　^
-----------------------------------
[compare]:(text i=7)(pattern j=1)
 text    :Aiue Pappa
 pattern :Oppao
         :　　　　　　^
It was 7th! !!

I hope you find it helpful.

References

-Java Character type reference ・ [C Language Algorithm-BM Method](http://capm-network.com/?tag=C%E8%A8%80%E8%AA%9E%E3%82%A2%E3%83%AB%E3 % 82% B4% E3% 83% AA% E3% 82% BA% E3% 83% A0-BM% E6% B3% 95) -Unicode-compatible character code table -[Java] char type can be cast to int type

Boyer-Moore implementation in Java