The mystery of C language integer literals [Commentary]

Well, I'll explain it. If you don't know what it is, read the previous article.

Understand the specifications of C language!

I wonder if I have a textbook at hand. What is expensive? Well, I don't have one either. It can't be helped, so take a look at the textbook cursive script.

Well, why did it look like the previous article, but it can only be said that it is the result of operating based on the C language specifications. However, there are some environment-dependent parts in the operation. The previous article is the result in Clang on Mac OS X (64bit). This environment is LP64 (int: 32 / long: 64 / void *: 64), so I will explain it accordingly. Please note that the operation is slightly different from ILP32 (32bit environment) and LLP64 (64bit VC ++) (however, the phenomenon should be the same for a normal PC). Let's look at the specifications related to this operation.

`-` is a unary operator, not a literal

The first note is -. When you write -1, literals are only 1 and - is a unary operator. In other words, the interpretation is exactly the same as when it is written as -(1). I wonder if there are many people who don't know anything other than this. I think that even in other programming languages (Python, Ruby, etc.), - is treated as a unary operator and is not considered a literal in many cases. [^ 1]

[^ 1]: For integer literals, not for floating point literals.

Decimal integer literals are promoted arbitrarily

C language integer literals are of type ʻint. If you add L, LL, ʻU, etc. to the end, it becomes another type such as long type or ʻunsigend inttype. Is not enough. Actually, it may be along` type without adding L.

What this means is that if a decimal integer literal exceeds the range of the ʻint` type, it will be automatically promoted and treated as a larger type.

Promotion order: ʻint-> long-> long long`

Integers that are too large to be represented by long long will result in a syntax error. [^ 2]

[^ 2]: In Clang, no error occurs and the compile passes with a warning display, but it is not correct for C language.

Integer literals in octal / 16 are promoted arbitrarily with unsigned

The previous story is for decimal numbers. There is a slight difference between octal and hexadecimal numbers. Normally, it is ʻint`, and it is the same that it is automatically promoted and treated as a large type, but the target to be promoted includes an unsigned type. In other words, it will be promoted in the following order.

Promotion order: ʻint-> ʻunsigned int-> long-> ʻunsigned long-> long long-> ʻunsigned long long

Let's see the actual processing.

I mentioned that it depends on the environment, so let's look at the environment first. In Clang on Mac OS X, it looks like this:

The size of char is 8bits (that is, 1Byte = 8bits)
The size of ʻint` is 4Bytes (= 32bits)
The size of long is 8Bytes (= 64bits)
Negative representation of integer values is 2's complement
INT_MAX = 2³¹-1 = 2147483647
INT_MIN = -2³¹ = -2147483648
UINT_MAX = 2³²-1 = 4294967295

About `sizeof (-2147483648)`

First, let's look at it from here.

Since - is a unary operator,sizeof (-2147483648)can be interpreted assizeof (-(2147483648)). Since 2147483648 exceeds ʻINT_MAX, it cannot be represented by ʻint type. Therefore, automatic promotion by a decimal integer occurs and it becomes a long type. Since the unary operator - does not promote types for types greater than ʻint, -(2147483648) is also of type long. Since long is 8Bytes, sizeof (-(2147483648)) is 8`.

When you think about it, it's a matter of course.

About `(-2147483648 == -0x80000000)`

Well, next is this one.

Similarly, (-2147483648 == -0x80000000) can be interpreted as ((-(2147483648)) == (-(0x80000000))). First of all, this -(2147483648), but I mentioned earlier that it is a long type. So what about the actual memory? (Actually, it is little endian, but for the sake of clarity, it is written like big endian. Note that the addresses are in reverse order.)

[address]           07 06 05 04 03 02 01 00
2147483648    -> 00 00 00 00 80 00 00 00 #long type
-(2147483648) -> FF FF FF FF 80 00 00 00 #long type

Next is 0x80000000, which is a hexadecimal number, so if it doesn't fit in ʻint, check if it fits in ʻunsigned int. Since it is larger than ʻINT_MAX and less than ʻUINT_MAX, it is interpreted as ʻunsigned inttype. And since the unary operator- does not promote types for types greater than ʻint,-(0x80000000)is also ʻunsigned int`.

[address]           03 02 01 00
0x80000000    -> 80 00 00 00 #unsigned int type
-(0x80000000) -> 80 00 00 00 #unsigned int type

After this there is a == operator, but because the types are different, promotion occurs. Since the long type is larger,-(0x80000000) is also promoted to the long type. Since it is promoted from unsigned, it simply adds 0.

[address]                   07 06 05 04 03 02 01 00
(long)(-(0x80000000)) -> 00 00 00 00 80 00 00 00 #long type

This is different from -(2147483648), so the result is 0.

Why do other things work?

Other operations work because they are truncated as ʻint types, or they are promoted from the same signed to signed, so there is no problem. In addition, ʻINT_MIN is a macro, but it is defined in the form of(-2147483647-1)so that it becomes ʻint` type properly.

That's all, did you understand? Well, my interpretation may be suspicious. Do not mix with / without sign, and use macros for maximum and minimum values.

I want to see if it's true!

Finally, you can find out what type it actually is by using the following command. If you have Clang, give it a try.

clang -Xclang -ast-dump -fsyntax-only -std=c11 num_literal.c