In Unicode (UTF-16), one character is usually represented by two bytes. However, as the number of characters that should be handled by Unicode increased, the number of characters that can be expressed in 2 bytes (65535 characters) became insufficient, and by expressing some characters in 4 bytes, the number of characters that can be handled increased. .. Such 4-byte characters are called surrogate pairs.
The character "rebuke" is a surrogate pair, so if you normally use the length
method, it will be considered two characters.
Therefore, to correctly count strings containing surrogate pairs, use the codePointCount
method instead of the length
method.
var str1 = "Hello";
System.out.println(str1.length()); //Result: 5
var str2 = "Scold";
System.out.println(str2.length()); //Result: 3
//This will get the correct number of characters
System.out.println(str2.codePointCount(0, str2.length())); //Result: 2
codePointCount
method/**
@param begin Start position for length
@param end End position for length
@number of return characters
*/
public int codePointCount(int begin, int end)
Recommended Posts