Verification environment

Terminal: MacBook Air OS: macOS High Sierra Swift: 4.0.3 (swiftlang-900.0.74.1 clang-900.0.39.2) Clang: Apple LLVM version 9.0.0 (clang-900.0.39.2)

Overview

I think everyone is casually using the following methods to convert C strings (char \ *) to Swift or Objective-C strings (String, NSString).

`String.swift`


init?(cString: UnsafePointer<CChar>) //typealias CChar = Int8

First, I would like to introduce the cString that is passed to this method, that is, the C string. Next, in fact, on the NSString side, the above method has been abolished, and a method that specifies encoding at the same time is prepared as shown below. I would like to think about the encoding value passed at this time.

`String.swift`


init?(cString: UnsafePointer<CChar>, encoding enc: String.Encoding)

About the letter C

Char is provided as a type that represents C language characters. A 1-byte value can be stored in this char type variable. The C character encloses the character in single quotes. And the value enclosed in single quotes is called a character literal.

`C letter.c`


int main(void) {
    char c = '*'; //「'It is converted to a 1-byte ASCII value by enclosing the character with a "mark".
    printf("%c\n", c); // *
    printf("%ld\n", sizeof(c)); // 1

}

As mentioned in the comments, the above is actually syntactic sugar as follows.

`letter c.c`


int main(void) {
    char c = 42;
    printf("%c\n", c); // *
    printf("%ld\n", sizeof(c)); // 1
}

In other words, the substance of a character literal is "just a number".

Also, if you enclose multibyte characters in single quotation marks as shown below, the generated number (* 1) will exceed the size of char (1 byte), resulting in a compile error.

(\ * 1: For the generated value, see the section below "[When multibyte characters are stored in the C string](https://qiita.com/ysn551/items/446074b22103233edd95#c%E3%81] % AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AB% E3% 83% 9E% E3% 83% AB% E3% 83% 81% E3% 83% 90 % E3% 82% A4% E3% 83% 88% E6% 96% 87% E5% AD% 97% E3% 82% 92% E6% A0% BC% E7% B4% 8D% E3% 81% 97% E3 % 81% 9F% E5% A0% B4% E5% 90% 88) ”)

`C letter.c`


//This source code file is saved in UTF8
int main(void) {
    char c = 'Ah'; // error: character too large for enclosing character 
}

In other words, multi-byte characters cannot be stored as they are in char type variables.

The verification that the substance of a character literal is a 1-byte number can also be proved by being able to directly store the value in an int type (4 bytes) as shown below.

`The substance of a character literal is a 1-byte number.c`


int main(void) {
    int num = 'abcd';
    printf("%0x\n", num); // 64656667
}

The result of outputting the value of num in hexadecimal is "64656667", and if you read the value in byte units, you can see that it can be decomposed into "64,65,66,67".

Summary here

The C character is represented by the char type
The char type is a box for storing 1-byte values.
Values enclosed in single quotes are called character literals.
A character literal returns a box of type char, which contains a number (encoding value).

About the C string

A C language string is an array of type char. In other words, it is represented by an array for storing 1-byte data. Also, the C string encloses the character in double quotes. Values enclosed in double quotes are called string literals.

`C string.c`


int main(void) {
    char str[] = "Hello"; //The number of elements can be omitted by initializing at the same time.
    printf("sizeof(str)/sizeof(char) = %ld\n", sizeof(str)/sizeof(char)); // 6
}

The character string * Hello * used for initialization above is 5 characters, but the number of elements is 6. In fact, it has the following syntactic sugar.

`About the C string.c`


int main(void) {
    char str[] = {'H','e','l','l','o','\0'};
    printf("sizeof(str)/sizeof(char) = %ld\n", sizeof(str)/sizeof(char)); //6
}

In other words, the string literal " Hello " returns a char array with 6 elements that contains the null character at the end.

Summary here

The C string is an array of type char
Values enclosed in double quotes are called String Literals.
A string literal returns an array of type char, with the last value containing the null character (number 0).

When multibyte characters are stored in the C string

Although not mentioned in the above section, the results of saving the source code in UTF8 file and Shift-JIS file when initialized with multibyte characters as shown below are shown. I would like to see it.

`About the C string.c`


int main(void) {
    char str[] = "Ah"; //If you declare an array at the same time as initialization, you can omit the number of elements
    int size = sizeof(str);
    for (int i = 0; i < size; i++) {
        printf("%hhx ", str[i]); //Validate this output with each encoding
    }
}

Method of verification:

Open the editor
Change the encoding setting of the editor to Shift-JIS or UTF-8
Paste the source code and save
Compile with clang compiler (\ $ cc file.c)
Execute (\ $ ./a.out)

Output when saved in UTF8:

`case_utf8_result.txt`


e3 81 82 0

Output when saving in Shift-JIS:

`case_shift_jis_result.txt`


82 a0 0

Regarding each of the above values, please enter "A" in this Site to display the result. result

In other words, you can see that the result of a string literal of C language multibyte characters matches the encoding of a text editor.

This is a very natural result because we are passing the "file" in which the source code is written to the compiler, not the "source code".

So in UTF8, char str [] =" a " can be said to be syntactic sugar as follows.

`About the C string.c`


int main(void) {
    //e3 81 82 0 
    char str[] = {0xe3, 0x81, 0x82, 0x0};
    printf("%s \n", str); //If the terminal encoding setting is UTF8, "A" will be displayed.
}

When the above terminal encoding is set to UTF-8 and executed, "A" is displayed. If you use Shift-JIS, the characters will be garbled. (Settings → Profiles → Advance tag) Screen Shot 2017-12-25 at 13.57.40.png

As a result, the top is the result when Shift-JIS is set, and the bottom is the result when UTF8 is set. Screen Shot 2017-12-25 at 13.56.20.png

Summary here

An array of type char can be initialized with a string literal of multibyte characters
The value generated by the string literal is the encoding value of the text editor.
The compiler is passed the file that describes it, not the source code

About the encoding specified when converting to a Swift string

I would like to use the Swift API below to convert the characters passed from the C API to a Swift String. What should the encoding value specified at this time be?

`String.swift`


init?(cString: UnsafePointer<CChar>, encoding enc: String.Encoding) //CChar = Int8

The C program code to be verified is as follows.

`libc.c`


char* file_name() {
    return "hello.txt";
}

char* new_file_header_str() {
    FILE *f = fopen(file_name(), "r");
    if (f == NULL) return NULL;

    char *str = calloc(256, sizeof(char));
    fgets(str, 256, f); //Only one line
    fclose(f);
    return str;
}

If you call the above from Swift, the C char * type will be passed as the ʻUnsafeMutablePointer ` type.

First of all, I would like to verify that the C character obtained from the file_name function is converted to the Swift character. This string literal is returned as it is. In other words, you can see that the encoding value when converting this to a Swift string must be the same as the encoding in the libc.c file.

Next, what about the encoding value used to convert the C character obtained from the new_file_header_str function to a Swift string? Here, the character string of the hello.txt file is returned. So you can see that the encoding value you have to specify here must be the same as the encoding value where the hello.txt file is stored.

Below is a sample source code that saves the lib.c file in UTF-8 and the hello.txt file in Shift-JIS and calls each function from Swift.

`get_str_from_c.swift`


let name = file_name() //Optional<UnsafeMutablePointer<Int8>>
if let name = name,
    let converted = String(cString: name, encoding: .utf8) {
    print(converted)
} 

let header = new_file_header_str() //Optional<UnsafeMutablePointer<Int8>>
if let header = header,
    let converted = String(cString: header, encoding: .shiftJIS) {
    print(converted)
}

Please refer to the following for calling the C library. https://qiita.com/ysn551/items/83e06cf74ae628cb573c

Summary here

For the encoding value specified when converting a C string to a Swift string, specify the encoding value of the C source code file if the character returned from C is a string literal.

Python3 string literal

In this way, C string literals store encoding values directly, so they depend on the development environment. By the way, in the case of Swift compiler, only UTF8 files can be compiled.

On the other hand, in Python3, the value generated by a string literal is a number, but this one generates a Unicode value. Therefore, there is no need to consider encoding when exchanging string literals between files.

The verification result with python3 is as follows. By the way, Python2 uses the encoding value, so it is useless if the encoding between the source codes is different.

Save the following shift_jis.py file in Shift-JIS encoding

`shift_jis.py`


#! coding=shift-jis

word = "Nice to meet you"

Save the following utf8.py file in UTF8 and execute it.

`utf8.py`


#! coding=utf-8

import shift_jis as sh

if sh.word == "Nice to meet you": 
    print("true")
else:
    print("false")

When I run the above in python3, true is displayed, but in python2, false is displayed.

Final summary

I look forward to working with you in 2018. m (__) m

Introducing C ++ characters and strings for Swift programmers

Verification environment

Overview

String.swift

String.swift

About the letter C

C letter.c

letter c.c

C letter.c

The substance of a character literal is a 1-byte number.c

Summary here

About the C string

C string.c

About the C string.c

Summary here

When multibyte characters are stored in the C string

About the C string.c

case_utf8_result.txt

case_shift_jis_result.txt

About the C string.c

Summary here

About the encoding specified when converting to a Swift string

String.swift

libc.c

get_str_from_c.swift

Summary here

Python3 string literal

shift_jis.py

utf8.py

Final summary

`String.swift`

`String.swift`

`C letter.c`

`letter c.c`

`C letter.c`

`The substance of a character literal is a 1-byte number.c`

`C string.c`

`About the C string.c`

`About the C string.c`

`case_utf8_result.txt`

`case_shift_jis_result.txt`

`About the C string.c`

`String.swift`

`libc.c`

`get_str_from_c.swift`

`shift_jis.py`

`utf8.py`