Å (Ongustromu) and NFC @ Python

It's not Near Field Communication, it's Normalization Form Canonical Composition.

Unicode Normalization @ Wikipedia

In Unicode, Å and Å are different characters.

The latter smells like Latin-1 in terms of numbers. In fact, that's right.

When NFC normalized, Angstrom becomes A with upper ring. Let's check this with Python.

>>> import unicodedata
>>> ord(unicodedata.normalize('NFC', '\N{ANGSTROM SIGN}'))
197
>>> unicodedata.name(unicodedata.normalize('NFC', '\N{ANGSTROM SIGN}'))
'LATIN CAPITAL LETTER A WITH RING ABOVE'

unicodedata is a standard library module. It's a bonus, but NFD normalization, which is sometimes talked about on macOS, has 2 characters.

>>> len(unicodedata.normalize('NFD', '\N{ANGSTROM SIGN}'))
2
>>> [ord(ch) for ch in unicodedata.normalize('NFD', '\N{ANGSTROM SIGN}')]
[65, 778]
>>> [unicodedata.name(ch) for ch in unicodedata.normalize('NFD', '\N{ANGSTROM SIGN}')]
['LATIN CAPITAL LETTER A', 'COMBINING RING ABOVE']

In theory, this conversion can be a problem. For example, in Shift_JIS, "Angstrom" can be expressed, but "A with upper ring" cannot be expressed. If you read characters from a text file saved in Shift_JIS format and then try to save in Shift_JIS format again after NFC normalization, problems may occur.

>>> with open('from.txt', encoding='shift_jis') as fr:
...    with open('to.txt', 'w', encoding='shift_jis') as fw:
...        fw.write(unicodedata.normalize('NFC', fr.read()))
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'shift_jis' codec can't encode character '\xc5' in position 0: illegal multibyte sequence

If you omit reading non-essential files

>>> '\N{ANGSTROM SIGN}'.encode('shift_jis')
b'\x81\xf0'
>>> unicodedata.normalize('NFC', '\N{ANGSTROM SIGN}').encode('shift_jis')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'shift_jis' codec can't encode character '\xc5' in position 0: illegal multibyte sequence

More frankly:

>>> '\N{LATIN CAPITAL LETTER A WITH RING ABOVE}'.encode('shift_jis')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'shift_jis' codec can't encode character '\xc5' in position 0: illegal multibyte sequence

I learned this example in "Introduction to Character Code Technology for Programmers", but the reason is unknown in the same book ("For some reason it is not" p353).

So, I happened to see the following description on Wikipedia.

The unit symbol of Angstrom is this character, but Unicode and JIS X 0213 define it as a character different from the original character. However, the Unicode angstrom symbol U + 212B is a compatible character that can only be used to maintain backward compatibility with older standards and is not recommended for use. (From Wikipedia)

I understand that there is a reason why it can be used only for backward compatibility.

However, all Unicode normalizations are quite worrisome. The letters are difficult.

As a bonus, if you search with either one in the browser, both will be caught. I think I'm searching after normalizing one of the four types. I'm not sure if the search operation has specifications common to all browsers.

When an end user's simple complaint, "This character is garbled," appears around here, it becomes "Hi". It's not someone else's affair, because I'm associated with a system where CP932, shift_jis, and UTF-8 are mixed up on Windows.

Recommended Posts

Å (Ongustromu) and NFC @ Python
[python] Compress and decompress
Python and numpy tips
[Python] pip and wheel
Batch design and python
Python iterators and generators
Python packages and modules
Vue-Cli and Python integration
Ruby, Python and map
python input and output
Python and Ruby split
Python3, venv and Ansible
Python asyncio and ContextVar
Read and write NFC tags in python using PaSoRi
Programming with Python and Tkinter
Encryption and decryption with Python
Python: Class and instance variables
3-3, Python strings and character codes
Python 2 series and 3 series (Anaconda edition)
Python and hardware-Using RS232C with Python-
Python on Ruby and angry Ruby on Python
Python indentation and string format
Python real division (/) and integer division (//)
Install Python and Flask (Windows 10)
About python objects and classes
About Python variables and objects
Apache mod_auth_tkt and Python AuthTkt
Understand Python packages and modules
# 2 [python3] Separation and comment out
Python shallow copy and deep copy
Python and ruby slice memo
Python installation and basic grammar
I compared Java and Python!
Python shallow and deep copy
About Python, len () and randint ()
About Python datetime and timezone
Install Python 3.7 and Django 3.0 (CentOS)
Python environment construction and TensorFlow
Python class variables and instance variables
Ruby and Python syntax ~ branch ~
[Python] Python and security-① What is Python?
Stack and Queue in Python
python metaclass and sqlalchemy declareative
Fibonacci and prime implementations (python)
Python basics: conditions and iterations
Python bitwise operator and OR
Python debug and test module
Python list and tuples and commas
Python variables and object IDs
Python list comprehensions and generators
About Python and regular expressions
python with pyenv and venv
Unittest and CI in Python
Maxout description and implementation (Python)
[python] Get quotient and remainder
Python 3 sorted and comparison functions
[Python] Depth-first search and breadth-first search
Identity and equivalence Python is and ==
Source installation and installation of Python
Python or and and operator trap
Challenge Python3 and Selenium Webdriver