Terminal: MacBook Air OS: macOS High Sierra Swift: 4.0.3 (swiftlang-900.0.74.1 clang-900.0.39.2) Clang: Apple LLVM version 9.0.0 (clang-900.0.39.2)
I think everyone is casually using the following methods to convert C strings (char \ *) to Swift or Objective-C strings (String, NSString).
String.swift
init?(cString: UnsafePointer<CChar>) //typealias CChar = Int8
First, I would like to introduce the cString that is passed to this method, that is, the C string. Next, in fact, on the NSString side, the above method has been abolished, and a method that specifies encoding at the same time is prepared as shown below. I would like to think about the encoding value passed at this time.
String.swift
init?(cString: UnsafePointer<CChar>, encoding enc: String.Encoding)
Char
is provided as a type that represents C language characters. A 1-byte value can be stored in this char type variable.
The C character encloses the character in single quotes. And the value enclosed in single quotes is called a character literal.
C letter.c
int main(void) {
char c = '*'; //「'It is converted to a 1-byte ASCII value by enclosing the character with a "mark".
printf("%c\n", c); // *
printf("%ld\n", sizeof(c)); // 1
}
As mentioned in the comments, the above is actually syntactic sugar as follows.
letter c.c
int main(void) {
char c = 42;
printf("%c\n", c); // *
printf("%ld\n", sizeof(c)); // 1
}
In other words, the substance of a character literal is "just a number".
Also, if you enclose multibyte characters in single quotation marks as shown below, the generated number (* 1) will exceed the size of char
(1 byte), resulting in a compile error.
(\ * 1: For the generated value, see the section below "[When multibyte characters are stored in the C string](https://qiita.com/ysn551/items/446074b22103233edd95#c%E3%81] % AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AB% E3% 83% 9E% E3% 83% AB% E3% 83% 81% E3% 83% 90 % E3% 82% A4% E3% 83% 88% E6% 96% 87% E5% AD% 97% E3% 82% 92% E6% A0% BC% E7% B4% 8D% E3% 81% 97% E3 % 81% 9F% E5% A0% B4% E5% 90% 88) ”)
C letter.c
//This source code file is saved in UTF8
int main(void) {
char c = 'Ah'; // error: character too large for enclosing character
}
In other words, multi-byte characters cannot be stored as they are in char type variables.
The verification that the substance of a character literal is a 1-byte number can also be proved by being able to directly store the value in an int type (4 bytes) as shown below.
The substance of a character literal is a 1-byte number.c
int main(void) {
int num = 'abcd';
printf("%0x\n", num); // 64656667
}
The result of outputting the value of num in hexadecimal is "64656667", and if you read the value in byte units, you can see that it can be decomposed into "64,65,66,67".
char
typechar
type is a box for storing 1-byte values.char
, which contains a number (encoding value).A C language string is an array of type char
. In other words, it is represented by an array for storing 1-byte data.
Also, the C string encloses the character in double quotes. Values enclosed in double quotes are called string literals.
C string.c
int main(void) {
char str[] = "Hello"; //The number of elements can be omitted by initializing at the same time.
printf("sizeof(str)/sizeof(char) = %ld\n", sizeof(str)/sizeof(char)); // 6
}
The character string * Hello * used for initialization above is 5 characters, but the number of elements is 6. In fact, it has the following syntactic sugar.
About the C string.c
int main(void) {
char str[] = {'H','e','l','l','o','\0'};
printf("sizeof(str)/sizeof(char) = %ld\n", sizeof(str)/sizeof(char)); //6
}
In other words, the string literal " Hello "
returns a char array with 6 elements that contains the null character at the end.
char
char
, with the last value containing the null character (number 0).Although not mentioned in the above section, the results of saving the source code in UTF8 file and Shift-JIS file when initialized with multibyte characters as shown below are shown. I would like to see it.
About the C string.c
int main(void) {
char str[] = "Ah"; //If you declare an array at the same time as initialization, you can omit the number of elements
int size = sizeof(str);
for (int i = 0; i < size; i++) {
printf("%hhx ", str[i]); //Validate this output with each encoding
}
}
Method of verification:
Output when saved in UTF8:
case_utf8_result.txt
e3 81 82 0
Output when saving in Shift-JIS:
case_shift_jis_result.txt
82 a0 0
Regarding each of the above values, please enter "A" in this Site to display the result. result
In other words, you can see that the result of a string literal of C language multibyte characters matches the encoding of a text editor.
This is a very natural result because we are passing the "file" in which the source code is written to the compiler, not the "source code".
So in UTF8, char str [] =" a "
can be said to be syntactic sugar as follows.
About the C string.c
int main(void) {
//e3 81 82 0
char str[] = {0xe3, 0x81, 0x82, 0x0};
printf("%s \n", str); //If the terminal encoding setting is UTF8, "A" will be displayed.
}
When the above terminal encoding is set to UTF-8 and executed, "A" is displayed. If you use Shift-JIS, the characters will be garbled. (Settings → Profiles → Advance tag)
As a result, the top is the result when Shift-JIS is set, and the bottom is the result when UTF8 is set.
char
can be initialized with a string literal of multibyte charactersI would like to use the Swift API below to convert the characters passed from the C API to a Swift String. What should the encoding value specified at this time be?
String.swift
init?(cString: UnsafePointer<CChar>, encoding enc: String.Encoding) //CChar = Int8
The C program code to be verified is as follows.
libc.c
char* file_name() {
return "hello.txt";
}
char* new_file_header_str() {
FILE *f = fopen(file_name(), "r");
if (f == NULL) return NULL;
char *str = calloc(256, sizeof(char));
fgets(str, 256, f); //Only one line
fclose(f);
return str;
}
If you call the above from Swift, the C char *
type will be passed as the ʻUnsafeMutablePointer
First of all, I would like to verify that the C character obtained from the file_name
function is converted to the Swift character.
This string literal is returned as it is.
In other words, you can see that the encoding value when converting this to a Swift string must be the same as the encoding in the libc.c file.
Next, what about the encoding value used to convert the C character obtained from the new_file_header_str
function to a Swift string?
Here, the character string of the hello.txt
file is returned.
So you can see that the encoding value you have to specify here must be the same as the encoding value where the hello.txt
file is stored.
Below is a sample source code that saves the lib.c file in UTF-8 and the hello.txt file in Shift-JIS and calls each function from Swift.
get_str_from_c.swift
let name = file_name() //Optional<UnsafeMutablePointer<Int8>>
if let name = name,
let converted = String(cString: name, encoding: .utf8) {
print(converted)
}
let header = new_file_header_str() //Optional<UnsafeMutablePointer<Int8>>
if let header = header,
let converted = String(cString: header, encoding: .shiftJIS) {
print(converted)
}
Please refer to the following for calling the C library. https://qiita.com/ysn551/items/83e06cf74ae628cb573c
In this way, C string literals store encoding values directly, so they depend on the development environment. By the way, in the case of Swift compiler, only UTF8 files can be compiled.
On the other hand, in Python3, the value generated by a string literal is a number, but this one generates a Unicode value. Therefore, there is no need to consider encoding when exchanging string literals between files.
The verification result with python3 is as follows. By the way, Python2 uses the encoding value, so it is useless if the encoding between the source codes is different.
Save the following shift_jis.py file in Shift-JIS encoding
shift_jis.py
#! coding=shift-jis
word = "Nice to meet you"
Save the following utf8.py file in UTF8 and execute it.
utf8.py
#! coding=utf-8
import shift_jis as sh
if sh.word == "Nice to meet you":
print("true")
else:
print("false")
When I run the above in python3, true is displayed, but in python2, false is displayed.
I look forward to working with you in 2018. m (__) m
Recommended Posts