A system call is a mechanism by which an application uses the functions provided by the OS, but knowing about system calls is important for understanding how an application works.
This is because almost all of the important things in the operation of the application are realized by using system calls. For example, network communication, file input / output, new process creation, interprocess communication, container creation, etc. are realized by using system calls. On the contrary, the only thing that an application can do without using system calls is calculation on the CPU and input / output to memory.
The content discussed in this article is about the general nature and mechanics of system calls. In the first place, I will explain what a system call is and how it is realized. In addition, we will show you how to use system calls directly without going through the library, or to go deep into the kernel to find out the implementation of system calls.
The contents are as follows.
-[Purpose and role of system call](https://qiita.com/sxarp/items/aff43dd83b0da69b92ce#%E3%82%B7%E3%82%B9%E3%83%86%E3%83%A0%E3 % 82% B3% E3% 83% BC% E3% 83% AB% E3% 81% AE% E7% 9B% AE% E7% 9A% 84% E3% 81% A8% E5% BD% B9% E5% 89 % B2) -[How the application calls the system call](https://qiita.com/sxarp/items/aff43dd83b0da69b92ce#%E3%82%A2%E3%83%97%E3%83%AA%E3%82%B1% E3% 83% BC% E3% 82% B7% E3% 83% A7% E3% 83% B3% E3% 81% 8C% E3% 82% B7% E3% 82% B9% E3% 83% 86% E3% 83% A0% E3% 82% B3% E3% 83% BC% E3% 83% AB% E3% 82% 92% E5% 91% BC% E3% 81% B3% E5% 87% BA% E3% 81% 99% E4% BB% 95% E7% B5% 84% E3% 81% BF) -[System call implementation method and internal implementation](https://qiita.com/sxarp/items/aff43dd83b0da69b92ce#%E3%82%B7%E3%82%B9%E3%83%86%E3%83%A0 % E3% 82% B3% E3% 83% BC% E3% 83% AB% E3% 81% AE% E5% AE% 9F% E7% 8F% BE% E6% 96% B9% E6% B3% 95% E3 % 81% A8% E5% 86% 85% E9% 83% A8% E5% AE% 9F% E8% A3% 85)
According to Linux Kernel Development etc., a system call mechanism is introduced. There are two advantages:
System calls provide applications with a simple, abstracted interface for manipulating hardware. This eliminates the need for application code to be aware of the underlying hardware details.
For example, the system call write
is a common interface for writing a byte string to something.
There are a wide variety of things that can be specified as write targets, that is, they can be handled as files. For example, various output devices such as pipes used for interprocess communication, sockets used for network communication, monitors, etc., as well as various types of file systems. And so on.
The basic philosophy of Linux and Unix is Everything is a file
, but various types used for input and output including write
. It can be said that the system call of is a realization of this.
System calls mediate between applications and resources managed by the OS, and have the role of preventing applications from using resources in the wrong way or using them in a way that poses a security problem. This allows applications to use resources safely and securely.
As an example of safe use of resources, for example, allocating a memory area by a process (using mmap
) There will be. A hardware resource called memory is shared with other processes and OS, and if used in the wrong way, there is a risk of destroying other processes and OS. Through this system call when a process wants to allocate a memory area, the OS can safely allocate the memory area to each process.
Also, an example of secure use of resources would be access control of files based on permission information.
What is the application doing when invoking a system call? How is it different from a normal function call? Here, we will explain how the application calls the system call.
The steps that an application takes when invoking a system call are the following three steps:
In the following, we will explain the above procedure in detail using the code that outputs characters to the screen (standard output) as an example.
The system call write
is used for standard output and writing to a file, but a sample that calls this system call Let's take a look at the steps in which an application calls a system call using code.
First, launch gcc image, which is used to run the sample code, as a container.
$ docker container run -it --rm gcc /bin/bash
Create a file hi.s
with the following assembly code in the launched container. I will explain the code later.
# cat <<EOF > hi.s
.intel_syntax noprefix
.global main
main:
push rbp
mov rbp, rsp
push 0xA216948
mov rax, 1
mov rdi, 1
mov rsi, rsp
mov rdx, 4
syscall
mov rsp, rbp
pop rbp
ret
EOF
When you compile and execute it, the characters are output to the screen as shown below.
# gcc -o hi.o hi.s; ./hi.o
Hi!
The method of calling system calls is slightly different depending on the CPU specifications / architecture, but the above code is the most popular x86-64. It is an example of calling the write
system call in the architecture.
Before explaining the sample code introduced above, let's first explain how to call a system call on x86-64. What you do with other architectures doesn't change that much.
Invoking system calls on x86-64 is done in three steps:
rax
register to the system call numbersyscall
First of all, in step 1, here, specify the system call called system call number in the register called rax
(= the memory of 16 to 64 bit size built in the CPU, which can be accessed at extremely high speed from the CPU). Stores the number to be used. This number allows you to identify which system call the kernel should make.
The system call number and the corresponding system call are specified, for example:
system call number | System call name | Contents |
---|---|---|
0 | read | Read |
1 | write | Export |
2 | open | Open file |
57 | fork | Launch a new process |
A comprehensive table that includes more than the above examples can be found at here.
In the sample code, the parts corresponding to this step are:
mov rax, 1
Next, in step 2, here we will store the arguments to be passed to the system call in registers.
You can pass up to 6 arguments, and the registers rdi
, rsi
, rdx
, r10
, r9
, and r8
are to be used in order from the first argument.
For the system call write
, the arguments and registers to pass are specified as follows:
Register name | rdi | rsi | rdx |
---|---|---|---|
argument | File descriptor | Start address of byte string to write | Byte string size to export |
You can find out what arguments to use for other system calls here [https://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/).
By the way, what is a file descriptor (set in the rdi
register) is like a port number associated with each process that controls input / output from the process. By default, the file descriptor can use three values from 0 to 2, which correspond to 0: standard input, 1: standard output, and 2: standard error, respectively.
When you open a file or socket, new file descriptors are assigned in succession, 3, 4, 5, and so on, and system calls such as write write
and read read
are executed for them (system call). Can be passed as an argument to).
In the sample code, the parts corresponding to this step are:
mov rdi, 1
mov rsi, rsp
mov rdx, 4
Step 3 issues an instruction / statement called syscall
.
This causes the CPU to stop executing the application's code, switch to a mode that executes the kernel's code, and then jump to the code that executes the system calls inside the kernel (system call handler).
When you exit the syscall
instruction, the return value is set in the rax
register and can be referenced as needed.
This part will be explained in detail later in the article [How to implement system calls and internal implementation].
In the sample code, the parts corresponding to this step are:
syscall
I added a comment so that I can understand the meaning of the sample code. The point is the line labeled syscall
and just before it.
.intel_syntax noprefix #Code format
.global main #Label of execution start point
main:
#Preprocessing
push rbp
mov rbp, rsp
#String'Hi!'On the stack
push 0xA216948
#Step 1 with system call number'1'(=Supports write)To specify
mov rax, 1
#Step 2 Set the arguments to be passed to the write system call in the register.
mov rdi, 1 #1 in the file descriptor to be exported(Standard output)Set
mov rsi, rsp #Start address of stack(letter`H`Is included)Set
mov rdx, 4 #Set the size of the output string
#Step 3 Make a system call
syscall
#Post-processing
mov rsp, rbp
pop rbp
ret
By the way, if you add a little about the part of the above code that passes the start address of the character string in step 2.
mov rsi, rsp #Start address of stack(letter`H`Is included)Set
The string is stored on the stack, but the register rsp
is a special register that points to the start address of the stack. So by setting the value of rsp
to rsi
, you are passing the start address of the string to the system call.
System calls are usually provided as a C language library. When using system calls, you can basically use this wrapper library, and you do not need to write assembly code directly.
For example, write (2)
is defined as a C function with the following signature: ](Http://man7.org/linux/man-pages/man2/write.2.html)
ssize_t write(int fd, const void *buf, size_t count);
On the other hand, if you don't want to use the C library, you'll have to implement your own system call calls as assembly code, as you did in the example above. The Go language, for example, takes that approach.
golang/sys/unix/asm_linux_amd64.s
TEXT ·SyscallNoError(SB),NOSPLIT,$0-48
CALL runtime·entersyscall(SB)
MOVQ a1+8(FP), DI
MOVQ a2+16(FP), SI
MOVQ a3+24(FP), DX
MOVQ $0, R10
MOVQ $0, R8
MOVQ $0, R9
MOVQ trap+0(FP), AX // syscall entry
SYSCALL
MOVQ AX, r1+32(FP)
MOVQ DX, r2+40(FP)
CALL runtime·exitsyscall(SB)
RET
To supplement the above code, an unfamiliar register such as ʻAXis an alias that points to the [16bit part] such as the
rax` register. ](Https://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture)
When a system call is issued, the CPU ** jumps ** from the currently executing code to the code in the kernel. How is such a function to jump the execution code in the middle realized on the CPU? To understand this, you need to understand how the CPU executes code (machine language) in the first place.
Execution of machine language on the CPU is performed by repeating the following steps.
An instruction is a byte string that can be interpreted by the CPU as a single instruction, and has a one-to-one correspondence with a single line of assembly code. For example, in the assembly code given above, the correspondence between machine language and instructions is as follows.
# objdump -d -M intel ./hi.o | grep syscall -B 4
66c: 48 c7 c0 01 00 00 00 mov rax,0x1
673: 48 c7 c7 01 00 00 00 mov rdi,0x1
67a: 48 89 e6 mov rsi,rsp
67d: 48 c7 c2 20 00 00 00 mov rdx,0x20
684: 0f 05 syscall
In addition, the register that stores the address of the instruction currently being executed is called the program counter register or instruction pointer register, and when one instruction is executed, it is added and updated to the value of the address of the instruction immediately after. This causes the code to be executed sequentially.
It is a function to jump the code executed by the CPU in the middle, but this issues an instruction (for example, jmp
) that directly rewrites the program counter register. You can do it by doing. What you can do with this includes not only jumping code into the kernel with system calls, but more generally conditional branching, looping, function calls, and so on.
The general flow of system call processing is as follows.
Steps 1 and 3 are jumping into and returning from the kernel, which is achieved by dedicated instructions for x86-64, syscall
and sysretq
.
In other words, this part is implemented in hardware, that is, by logic circuits in the CPU that are designed to meet the x86-64 specifications.
In step 2, this is done in code called the system call handler in the kernel. Within the system call handler, a dispatch is made to the implementation of each system call based on the system call number.
In a little more detail, the system call processing is decomposed as follows.
syscall
instruction and read the value of the register containing the address of the system call handler into the program counter register.sysretq
instruction restores the original code on the user process.It looks like this in the figure.
We will explain each element of (1) to (5) below.
When the CPU executes the syscall
instruction, it roughly does the following:
FLAGS
register, which represents the current execution mode of the CPU, to the R11
register.FLAGS
register with the value of the ʻIA32_FMASK MSR` register and switch to a mode in which the CPU can execute kernel code.RIP
to the RCX
registerregister into the program counter register
RIP` and jump to the system call handler.** 1. ** FLAGS register is a register that represents the current CPU state / execution mode of the CPU, for example, the CPU is Protection Ring //ja.wikipedia.org/wiki/%E3%83%AA%E3%83%B3%E3%82%B0%E3%83%97%E3%83%AD%E3%83%86%E3% 82% AF% E3% 82% B7% E3% 83% A7% E3% 83% B3) It shows where you are on. Since we want to restore the original state when returning from the system call to the user process, save the current value in the R11 register.
** 2. ** By masking the value of the FLAGS
register with the value of the ʻIA32_FMASK MSR` register, the CPU mode is switched and transitions to Privilege Level 0, that is, the mode in which the kernel code can be executed.
Privilege Level is represented by 12 ~ 13bit of FLAGS register, but in order to make these 00
(Privilege Level 0), it is as follows. Operations are performed when calling syscall
.
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
The set of values in the ʻIA32_FMASKregister, which acts as a mask (note that it takes
NOT`), is done in the kernel with the following code.
linux/arch/x86/kernel/cpu/common.c
/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
The above X86_EFLAGS_IOPL
is a mask for the role of changing 12 \ ~ 13bit to 00
, but you can see that it is actually defined as a value so that only 12 \ ~ 13bit becomes 1.
linux/arch/x86/include/uapi/asm/processor-flags.h
#define X86_EFLAGS_IOPL_BIT 12 /* I/O Privilege Level (2 bits) */
#define X86_EFLAGS_IOPL (_AC(3,UL) << X86_EFLAGS_IOPL_BIT)
** 3. ** Save the value of the program counter register RIP
to the RCX
register. If you do not do this, you will not know the location of the original instruction (= program counter value) when returning from the system call handler to the user process.
** 4. ** Reads the address of the system call handler set in ʻIA32_LSTAR MSR to the program counter register
RIP and jumps to the system call handler. (5) introduces how the handler address is read into ʻIA32_LSTAR MSR
.
For other detailed CPU behavior when syscall
is called, see see here.
A system call handler is code in the kernel that is executed after the syscall
instruction is called, in which the system call is processed.
Here, we will introduce the pre-processing part of the system call handler, the part that packs the value passed to the register as an argument into the structure and passes it to the subsequent processing.
First, the system call handler entry / entry point looks like this:
linux/arch/x86/entry/entry_64.S
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
* Interrupts are off on entry.
Various processes are performed in this ʻentry_SYSCALL_64`, and one of them is constructing a structure on the stack that has the value field passed as an argument to the register as shown below. I will.
linux/arch/x86/entry/entry_64.S
/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
GLOBAL(entry_SYSCALL_64_after_hwframe)
pushq %rax /* pt_regs->orig_ax */
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
In the above code, the values of the rcx
and r11
registers (which were used for saving when calling the syscall
instruction) and the rax
register are assigned to the structure pt_regs
on the stack. Also, the assignment of values such as rdi
and rsi
used as system call arguments to the structure is the last [macro PUSH_AND_CLEAR_REGS
](https://github.com/torvalds/linux/blob/ It is defined in bfeffd155283772bbe78c6a05dec7c0128ee500c / arch / x86 / entry / calling.h # L100-L145).
The structure created above and the system call number are passed to the function do_syscall_64
that processes the system call below.
linux/arch/x86/entry/entry_64.S#L173-L175
movq %rax, %rdi
movq %rsp, %rsi
call do_syscall_64 /* returns with IRQs disabled */
The argument is passed to do_syscall_64
by using a register. As a rule when passing arguments to a function on x86-64, in order from the first argument, rdi
, rsi
, rdx
,rcx You are supposed to use the
, r8
, and r9
registers.
So if you want to pass an argument to the function do_syscall_64
with the following signature,
__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
You can do this by setting the values you want to pass to the two registers rdi
and rsi
.
--In movq% rax,% rdi
, pass the system call number set in rax
to the rdi
register corresponding to the first argument.
--The value of the rsp
register that points to the stack start address (that is, the start address of the structure on the stack) for the rsi
register corresponding to the second argument in movq% rsp,% rsi
. give
The implementation of each system call is called within do_syscall_64
. Specifically, by specifying an element with system call number for the array sys_call_table
that contains the function that implements each system call, dispatch is performed to the implementation of the corresponding system call.
Also, the implementation of each system call is defined in the SYSCALL_DEFINE *
macro, which you can find as a guide.
Let's take a look at the relevant kernel code below.
As the argument of the do_syscall_64
function, the system call number is passed to the first argument nr
, and the structure consisting of the register value at the time of system call call is passed to the second argument regs
.
__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
struct thread_info *ti;
enter_from_user_mode();
local_irq_enable();
sys_call_table
implements the processing of each system call as defined in here Although it is an array of functions, by specifying the elements of the array with system call number nr
and passing the structure regs
, it is dispatched to the processing of each system call.
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
}
regs-> ax = sys_call_table [nr](regs);
is the part that is dispatching. Also, the return value of the function is set in the field of the structure corresponding to the ʻAXregister (= rax register), but after this value exits the
syscall instruction, it is actually
rax` as the return value. It will be set in the register.
It is a function that actually processes system calls, which is an element of the sys_call_table
array. For example, the processing of write
is implemented below.
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
return ksys_write(fd, buf, count);
}
Generally, the implementation of each system call is defined in SYSCALL_DEFINE *
macro. So you can use this as a guide to find each implementation.
For more information on the sys_call_table
andSYSCALL_DEFINE *
macro, see this article (https://lwn.net/Articles/604287/).
The return from the system call handler to the original code in the user process is done by the sysretq
instruction. The processing performed by the sysetq
instruction is almost the reverse of the syscall
.
--For the RFLAGS
register, read the value of the R11
register (which saved the original value of RFLAGS
) to return it to the original value, and the mode (Privilege level) of the CPU that executes the user process. Return to 3)
--For the program counter register RIP
, read the value of the RCX
register (which saved the original value of RIP
) to return it to the original value, and the original instruction that called syscall
. Return to the point of
There are various other processes being performed, but please refer to here for details.
The code that calls sysretq
in the kernel, but the code that arrives last in the system call handler is below:
linux/arch/x86/entry/entry_64.S
popq %rdi
popq %rsp
USERGS_SYSRET64
END(entry_SYSCALL_64)
Sysretq
is called by ʻUSERGS_SYSRET64` in this.
linux/arch/x86/include/asm/irqflags.h
#define USERGS_SYSRET64 \
swapgs; \
sysretq;
When the syscall
instruction was called, the system call handler address was read from the ʻIA32_LSTAR MSR register for the program counter register
RIP, and a jump to the system call handler was performed. So how does the ʻIA32_LSTAR MSR
register know the address of the system call handler?
The address of the system call handler is read into the ʻIA32_LSTAR MSR` register when the CPU is initialized.
Below, we will introduce the kernel code that supports this initialization process.
The CPU initialization is done below,
linux/arch/x86/kernel/cpu/common.c
/*
* cpu_init() initializes state that is per-CPU. Some data is already
* initialized (naturally) in the bootstrap process, such as the GDT
* and IDT. We reload them nevertheless, this function acts as a
* 'CPU state barrier', nothing should get across.
*/
#ifdef CONFIG_X86_64
void cpu_init(void)
{
Set the address by syscall_init ();
called in this Will be done.
linux/arch/x86/kernel/cpu/common.c
void syscall_init(void)
{
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
In the code above
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
Has written the address of the system call handler to the ʻIA32_LSTAR MSR` register.
The constants MSR_LSTAR
and ʻentry_SYSCALL_64, but the constant
MSR_LSTAR specifies the ʻIA32_LSTAR MSR
register as the register for the write target.
linux/arch/x86/include/asm/msr-index.h
#define MSR_LSTAR 0xc0000082 /* long mode SYSCALL target */
ʻEntry_SYSCALL_64` is the value prototyped in here Call handler entry point](https://github.com/torvalds/linux/blob/cd6c84d8f0cdc911df435bb075ba22ce3c605b07/arch/x86/entry/entry_64.S#L145-L147).
If you look at system calls, you'll often come across the concept of interrupts.
For example, in the kernel, the comment in the code below says ʻInterrupts are off, but what exactly is ʻinterrupts
? What does it have to do with system calls?
linux/arch/x86/entry/entry_64.S
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
* Interrupts are off on entry.
* We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
* it is too small to ever cause noticeable irq latency.
*/
First, regarding the relationship between interrupts and system calls, system calls can be regarded as a type of interrupt. When an interrupt occurs, the CPU interrupts the code currently being executed (whether in the user process or in the kernel), saves the execution state so that it can be restarted later, and then specific code in the kernel (= inter). Jump to the rapto handler). It's pretty much the same as the system calls we've seen so far. There are two ways to generate interrupts. Software-induced interrupts are called software interrupts, and hardware-induced interrupts are called hardware interrupts.
Software interrupt example
--Calling a system call --Processing when an exception occurs
Hardware interrupt example
--Response to keyboard input --Processing when a packet arrives at the network card
Thanks to the interrupt mechanism, it is possible to implement a function that "when something happens, the CPU jumps to a specific code without asking questions and processes it", which allows, for example, to input from hardware. You will be able to respond quickly.
In addition, x86-64 has dedicated instructions (syscall
and sysretq
) for system calls, but in x86-32 and other architectures one generation ago, system calls are in the interrupt mechanism. It is realized by.
For example, to implement a system call on x86-32, specify the vector 0x80
in ʻintinstructions](https://www.felixcloutier.com/x86/intn:into:int3:int1) [ʻint 0x80 It was done by calling the instruction
.
Some of the system call handlers are executed with interrupts off, that is, ʻirq off. ʻIrq
is an interrupt request, and when the CPU receives it, an interrupt is generated.
In general, some or all of ʻirq may be disabled when the interrupt handler is handling interrupts. This will prevent the situation where an interrupt occurs while processing an interrupt. If ʻirq
is turned off, interrupts cannot be accepted, so it cannot respond to keyboard input, for example. So code that runs with ʻirq` off should be short enough so that it doesn't take long to process.
If you would like to know more about interrupts, please refer to this page.
Here are some resources and keywords to help you find out more about system calls and kernels. It also serves as an introduction to the literature that was used as a reference when writing this article.
I referred to here when researching instructions. For an introduction to x86-64 assembly code, we also recommend Introduction to C Compiler Creation for Those Who Want to Know Lower Layers.
Linux Kernel Development is highly rated and is inside the kernel. Probably the best implementation commentary book. It's not easy in terms of content (at least for me), but I recommend it because the narrative is sometimes interesting and the explanation is very polite. Even if you are not interested in the kernel, Chapter 10 that explains the technique of parallel programming in the kernel may be helpful in various ways. We also recommend this note, which summarizes the contents of LKD.
Actually, I read the kernel code properly for the first time when writing this article.
The impression is, of course, that I can't understand everything written in the code, but it's not that difficult to follow the general flow of processing. There are quite a few parts where the comments are carefully written.
To read it, search for keywords (press the /
key) on github, or here. I was reading by using the definition source jump function of (latest / source).
I'm also an amateur when it comes to kernels, so there may be a better way to read it, but ...
I think it's okay to suppress processes, virtual memory, and multitasking as the basic important concepts related to the OS. The recommended keywords are as follows.
--Process --Program counter --Stack pointer --Process control block
--Virtual memory --Virtual address --Physical address --Memory management unit --Paging
--Multitasking --Process state transition --Context switch --Process scheduling
Recommended Posts