[PYTHON] The return value of len or unichr may change depending on whether it is UCS-2 or UCS-4.

The standard Python 2 for Mac OS X is built with UCS-2, so the values returned by the standard functions len and unichr are different than for UCS-4, which is widely used in Linux distribution. There is.

Behavior in UCS-2 build options

If the build option is UCS-2 and it contains U + 10000 and later characters, you can't just use len to find the number of characters. Even if it is installed by homebrew, it will be built by USC-2.

Use the value of sys.maxunicode to see if UCS-2 was specified for the build option.

>>> import sys
>>> 0xFFFF == sys.maxunicode
True

Applying len to the following string (U + 20BB7 U + 91CE U + 5BB6) gives a return value of 4.

>>> str = u'?Noya'
>>> 4 == len(str)
True

The internal representation of U + 20BB7 is the surrogate pair U + D842 U + DFB7.

>>> 0xD842 == ord(str[0])
True
>>> 0xDFB7 == ord(str[1])
True

Find the number of characters in consideration of UCS-2

Let's find the number of characters, considering that the range of the upper surrogate is from U + D800 to U + DBFF. For the sake of simplicity of the code, do not consider the case where the upper or lower surrogate is isolated. With UCS-4, you can use a for loop.

# -*- coding: utf-8 -*-

import sys

def utf8_len(str):

    length = 0

    if sys.maxunicode > 0xFFFF:
        for c in str:
            length += 1

        return length

    code_units = len(str)
    pos = 0
    cp = -1

    while pos < code_units:

        cp = ord(str[pos])
        length += 1

        if cp > 0xD7FF and 0xDC00 > cp:
            pos += 2
        else:
            pos += 1

    return length

Let's try the previous string again.

str = u'?Noya'
print(3 == utf8_len(str))

As an exercise, let's modify the code a bit and define a function that applies the callback character by character.

# -*- coding: utf-8 -*-

import sys

def utf8_each_char(str, func):

    if sys.maxunicode > 0xFFFF:
        for c in str:
            func(c)
    else:
        code_units = len(str)
        pos = 0
        buf = ''
        cp = -1

    while pos < code_units:
        buf =str[pos]
        cp = ord(buf)

        if cp > 0xD7FF and 0xDC00 > cp:
            buf += str[pos+1]
            func(buf)
            pos += 2
        else:
            func(buf)
            pos += 1

Let's display one character at a time. To use print with a lambda expression, you need to import print_function at the beginning of the file.

from __future__ import print_function

str = u'?Noya'
f = lambda c: print(c)
utf8_each_char(str, f)

Generate characters from code points with UCS-2 in mind

The USC-2 constraint also accepts unichr, which generates characters from code point integers, and does not accept integers 0x10000 and beyond.

>>> unichr(0x20BB7)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Unicode escape sequences are not affected by UCS-2.


>>> print(u"\U00020BB7")
?

The following is a definition of a user function that takes into account the restrictions of UCS-2.

# -*- coding: utf-8 -*-

import sys

def utf8_chr(cp):
    if 0xFFFF < sys.maxunicode or cp < 0x10000:
        return unichr(cp)

    cp -= 0x10000
    high = cp >> 10 | 0xD800
    low = cp & 0x3FF | 0xDC00

    return unichr(high) + unichr(low)

print(utf8_chr(0x20BB7))
print(utf8_chr(0x91CE))

Recommended Posts

The return value of len or unichr may change depending on whether it is UCS-2 or UCS-4.
In Python, change the behavior of the method depending on how it is called
Watch out for the return value of __len__
A simple reason why the return value of round (2.675,2) is 2.67 in python (it should be 2.68 in reality ...)
About the return value of pthread_mutex_init ()
About the return value of the histogram.
Basic tech that easily determines whether the value is "yes" or "no"
Let's change the color scheme of iTerm2 automatically depending on the time zone
Change the order of PostgreSQL on Heroku
How the reference of the python array changes depending on the presence or absence of subscripts
If branch depending on whether there is a specific element in the list
The value of pyTorch torch.var () is not distributed
Change the resolution of Ubuntu running on VirtualBox
[Python Data Frame] When the value is empty, fill it with the value of another column.
Return value of quit ()-Is there anything returned by the "function that ends everything"?
[Is it explosive !?] Setup for using the GPU version of Tensorflow on OS X
rsync Behavior changes depending on the presence or absence of the slash in the copy source
I tried to make it easy to change the setting of authenticated Proxy on Jupyter