After a little review of the character encoding, I realized that I had never been concerned about Unicode surrogate pairs. So, I practiced assembling the character strings of surrogate pairs and counting the number of characters in Java, which I also handle at work.
Until Java 1.4, surrogate pairs were not considered, but in 1.5, an API that considers surrogate pairs has been added. Therefore, in the above test code, we try to call the API up to 1.4 series and the API added in 1.5 and compare the behavior.
First, let's express a surrogate pair with char type. The upper surrogate and the lower surrogate are separated as separate char type variables and incorporated into the array.
char c1 = '\u3042'; // HIRAGANA LETTER A, cp=12354
char c2 = '\uD842'; // tuchi-yoshi (high), cp=134071
char c3 = '\uDFB7'; // tuchi-yoshi (low), cp=134071
char c4 = '\u30D5'; // katakana fu, cp=12501
char c5 = '\u309A'; // handakuten, cp=12442
char c6 = '\uD842'; // kuchi + shichi (high), cp=134047
char c7 = '\uDF9F'; // kuchi + shichi (low), cp=134047
String s = new String(new char[] { c1, c2, c3, c4, c5, c6, c7 });
assertEquals(s, "\u3042\uD842\uDFB7\u30D5\u309A\uD842\uDF9F");
Next, try copying the string using String.length ()
or String.charAt ()
, which does not consider surrogate pairs. Looking at the last ʻassertEquals (), it matches the string generated from the split ʻint []
of the surrogate pair. You can see how the upper surrogate and the lower surrogate are treated as independent characters and copied.
int len = s.length();
assertEquals(len, 7); // ignores surrogate pair :P
int[] actualCps = new int[len];
for (int i = 0; i < len; i++) {
char c = s.charAt(i);
actualCps[i] = (int) c;
}
// Ignores surrogate pairs... :(
// BUT JavaScript unicode escape in browser accepts this format...:(
assertEquals(actualCps, new int[] { 0x3042, 0xD842, 0xDFB7, 0x30D5, 0x309A, 0xD842, 0xDF9F });
Now try using String.codePointCount ()
and String.codePointAt ()
to consider surrogate pairs. If you look at the last ʻassertEquals ()`, you'll see that the surrogate paired character is the same as the Unicode code point hexadecimal string. You can check how the surrogate pair is handled by counting it as one character.
int countOfCp = s.codePointCount(0, len);
assertEquals(countOfCp, 5); // GOOD.
actualCps = new int[countOfCp];
for (int i = 0, j = 0, cp; i < len; i += Character.charCount(cp)) {
cp = s.codePointAt(i);
actualCps[j++] = cp;
}
// GOOD.
assertEquals(actualCps, new int[] { 0x3042, 0x20BB7, 0x30D5, 0x309A, 0x20B9F });
reference:
Recommended Posts