[PYTHON] How many bits is wchar_t? Wide character handling memo

It is sudden

How many bits is wchar_t defined in wchar.h in C language?

Did you think it was 16 bit? (I also thought) Actually, it depends on the environment. Sometimes it's not 16-bit.

** * For the sake of simplicity, we will not consider surrogate pairs here. ** **

Verification

main.c


#include <stdio.h>
#include <wchar.h>

int main() {
  wchar_t *s = L"ABCD";
  printf("%d %d\n", wcslen(s), sizeof(s[0]));
  return 0;
}

When I compile this with gcc, the result is

have become. L "", which represents wide characters, and wcslen (), which counts the number of characters, are made to match each other, but the reality of wchar_t is 16 bits for the former and the latter for the latter. Is different from 32-bit.

What's wrong

Let's assume that Unicode string data given from the outside is represented as a null-terminated string of 16 bits per character (UTF-16). If you try to output the character string and the number of characters, it looks like the following program, for example.

#include <stdio.h>
#include <wchar.h>

int main() {
  char data[] = {0x40, 0x00, 0x41, 0x00, 0x42, 0x00, 0x00, 0x00}; //Suppose this is given
  wchar_t *s = (wchar_t *)data;
  printf("%ls %d\n", s, wcslen(s));
  return 0;
}

However, this is in an environment where wchar_t is 16 bits.

@AB 3

Is displayed as expected, but it behaves unexpectedly in a 32-bit environment.

What to do

In C ++ 11, a type called char16_t was created as a data type that represents one UTF-16 character (excluding surrogate pairs). Similarly, 32-bit characters are char32_t. In addition, UTF-16 / UTF-32 has been added to represent string literals.

The example at the beginning seems to be good to write as follows (although it is C ++).

main.cpp


#include <stdio.h>
#include <string>

using namespace std;

int main() {
  char16_t s[] = u"ABCD"; // UTF-16 string literal
  printf("%d %d\n", char_traits<char16_t>::length(s), sizeof(s[0]));
  return 0;
}

Specify C ++ 11 in the compile options.

terminal


g++ -std=c++11 main.cpp

4 2 is output regardless of the size of wchar_t.

Unlike wchar_t, char16_t has no output in the printf () function. Instead, it looks like this. (Save the source code in UTF-8)

#include <string>
#include <codecvt>
#include <locale>
#include <iostream>

using namespace std;

int main() {
  char16_t s[] = u"AIUEO";
  wstring_convert<codecvt_utf8<char16_t>, char16_t> cv;
  cout << cv.to_bytes(s) << endl;
  return 0;
}

However, the following conditions apply to the use of codecvt_utf8. → codecvt_utf8 --cpprefjp --C ++ Japanese Reference

--Not recommended for C ++ 17 --Requires GCC 5.1 or higher

wchar_t trap: For Java (JNI)

The char type, which represents one character in Java, is 16 bits. For example, in JNI (Java Native Interface), when you want to return the character string data represented by UTF-16 (null terminated) as Java String type ( jstring), the number of characters I'm addicted to counting with wcslen.

C++Code


// jbyteArray (byte[])Given the type argument arg
jbyte *arg_ptr = env->GetByteArrayElements(arg, NULL);
//wcslen may give unexpected results
jstring ret_string = env->NewString((jchar *)arg_ptr, wcslen((wchar_t *)arg_ptr));
env->ReleaseByteArrayElements(arg, arg_ptr, 0);

Is it like this as a countermeasure? (Write down the required header declaration and ʻusing namespace std; `)

jstring ret_string = env->NewString((jchar *)arg_ptr, char_traits<char16_t>::length((char16_t *)arg_ptr));

If you just want to find out the length of the null-terminated string, you can do it yourself by looping ...

wchar_t trap: for Python (ctypes)

There is a library called ctypes for calling C / C ++ shared libraries (.dll, .so) from Python. Again, you may be addicted to creating and manipulating an array of Unicode characters from a string of bytes or passing it to another function.

Python


import ctypes
wstr = ctypes.create_unicode_buffer(u"AIUEO")
print(ctypes.sizeof(wstr))

If wchar_t is a 16-bit environment, 12 is output, and if it is a 32-bit environment, 24 is output. The "environment" here is like the environment of the compiler used to build Python.

In fact, create_unicode_buffer () itself may not be very useful. When dealing with Windows API, it would be good to deal with arguments of type wchar_t *.

Summary

wchar_t I'm scared. wcslen I'm scared. I pray that more people like me will not be licked and terrified.

Recommended Posts

How many bits is wchar_t? Wide character handling memo
Find out how many each character is in the string.