Umlaut garbled characters

phenomenon

One letter of the alphabet with a pronunciation distinction, such as umlaut used in German, is displayed as two garbled letters. For example, the place name Kärnten becomes Kärnten.

Most of the other alphabets aren't garbled, so it's hard to notice (in fact, if you google with "Kärnten" you'll see a lot of garbled sites).

This time I had this problem when reading and writing the exif metadata of an image in Java.

Cause

utf-8The character string saved asiso-8859-1Because it has been read as.

Below is an example of execution in Java REPL.

python


java> String s = new String("Kärnten")

java> byte[] iso = s.getBytes("ISO-8859-1")
byte[] iso = [75, -28, 114, 110, 116, 101, 110]

java> byte[] utf8 = s.getBytes("UTF-8")
byte[] utf8 = [75, -61, -92, 114, 110, 116, 101, 110]

Thus, "ä" is represented by 1 byte ( `-28```) in ISO-8859-1 and 2 bytes ( -61, -92```) in UTF-8. To. If you save the byte string in UTF-8 and then read it as ISO-8859-1, `` -61 will be interpreted as "Ã" and `` `-92 will be interpreted as" ¤ ". So

python


java> new String(utf8, "ISO-8859-1")
Kärnten

It turns into something like that.

The same applies to other pronunciation distinctions. Example:

Coping

Obviously, specify the correct character code for both reading and writing.

python


java> new String(utf8, "ISO-8859-1");
Kärnten

java> new String(iso, "ISO-8859-1");
Kärnten

reference

https://forum.httrack.com/readmsg/18923/indexhtml

Recommended Posts

Umlaut garbled characters
Fix garbled characters in SceneBuilder 11
Japanese characters described in MessageResources.properties are garbled
Challenge to deal with garbled characters with Java AudioSystem.getMixerInfo ()