linux (kernel) source analysis: system call call

Number of editions to be surveyed

linux-4.2(mainline)

Where to start

As a way to read the kernel source, examine the handling process of the kernel when calling a system call. (Continued from the previous article (I looked at system call entries. (linux source analysis)))

What should I look for? For the time being, search the source using the system call entry table (sys_call_table) written in the previous article as a key. Of course, there was an assembler source, etc., so it was difficult to separate them, so I searched the net for information that would be the starting point. I decided to check it by referring to the following site (as a cut).

In addition, refer to the following site.

-[Linux Kernel Documents> Wiki> 2.3 Hardware interrupt processing](https://osdn.jp/projects/linux-kernel-docs/wiki/2.3%E3%80%80%E3%83%8F%E3%83 % BC% E3% 83% 89% E3% 82% A6% E3% 82% A7% E3% 82% A2% E5% 89% B2% E3% 82% 8A% E8% BE% BC% E3% 81% BF % E5% 87% A6% E7% 90% 86)

Below, in the article, there is a place to type "search" to search the source, but this is a function created in .bashrc. The definition is as follows.

function search() {
        find . \( -name \*.c -o -name \*.h -o -name \*.S \) -exec grep -n $1 {} /dev/null \;
}

System call handling function registration process

The first thing I checked was the process of registering the handling function of the system call, which is a reference site. In "Assembly Programming Linux (system call)", in the source of /usr/src/linux/arch/i386/kernel/traps.c,

void __init trap_init(void)
	set_system_gate(SYSCALL_VECTOR,&system_call);

I heard that there is a code like this, so I searched for a research source. However, since the linux version (linux-2.2.16) of the article on the site is old There is no directory itself like arch / i386. There should be a process for which interrupt handling of "int 0x80" is registered, so search with the keyword SYSCALL_VECTOR.

kou77@ubuntu:~/linux-4.2$ search SYSCALL_VECTOR
./arch/x86/include/asm/irq_vectors.h:49:#define IA32_SYSCALL_VECTOR             0x80
./arch/x86/kernel/traps.c:896:  set_system_intr_gate(IA32_SYSCALL_VECTOR, entry_INT80_compat);
./arch/x86/kernel/traps.c:897:  set_bit(IA32_SYSCALL_VECTOR, used_vectors);
./arch/x86/kernel/traps.c:901:  set_system_trap_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
./arch/x86/kernel/traps.c:902:  set_bit(IA32_SYSCALL_VECTOR, used_vectors);
./arch/x86/kernel/irqinit.c:186:                /* IA32_SYSCALL_VECTOR could be used in trap_init already. */
./arch/x86/lguest/boot.c:93:    .syscall_vec = IA32_SYSCALL_VECTOR,
./arch/x86/lguest/boot.c:869:           if (i != IA32_SYSCALL_VECTOR)
./arch/m32r/include/asm/syscall.h:5:#define SYSCALL_VECTOR          "2"
./arch/m32r/include/asm/syscall.h:6:#define SYSCALL_VECTOR_ADDRESS  "0xa0"
./drivers/lguest/interrupts_and_traps.c:23:static unsigned int syscall_vector = IA32_SYSCALL_VECTOR;
./drivers/lguest/interrupts_and_traps.c:336:    /* Normal Linux IA32_SYSCALL_VECTOR or reserved vector? */
./drivers/lguest/interrupts_and_traps.c:337:    return num == IA32_SYSCALL_VECTOR || num == syscall_vector;
./drivers/lguest/interrupts_and_traps.c:354:    if (syscall_vector != IA32_SYSCALL_VECTOR) {
./drivers/lguest/interrupts_and_traps.c:369:    if (syscall_vector != IA32_SYSCALL_VECTOR)

The name has changed a little here as well. IA32_SYSCALL_VECTOR. Check the above search results in ./arch/x86/kernel/traps.c. The trap_init function was found immediately. Looking at IA32_SYSCALL_VECTOR, there was a place where the handling process of the target system call was registered.

void __init trap_init(void)
{
        int i;

#ifdef CONFIG_EISA
        void __iomem *p = early_ioremap(0x0FFFD9, 4);

        if (readl(p) == 'E' + ('I'<<8) + ('S'<<16) + ('A'<<24))
                EISA_bus = 1;
        early_iounmap(p, 4);
#endif

        set_intr_gate(X86_TRAP_DE, divide_error);
        set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
        /* int4 can be called from all */
        set_system_intr_gate(X86_TRAP_OF, &overflow);
        set_intr_gate(X86_TRAP_BR, bounds);
        set_intr_gate(X86_TRAP_UD, invalid_op);
        set_intr_gate(X86_TRAP_NM, device_not_available);
#ifdef CONFIG_X86_32
        set_task_gate(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS);
#else
        set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
#endif
        set_intr_gate(X86_TRAP_OLD_MF, coprocessor_segment_overrun);
        set_intr_gate(X86_TRAP_TS, invalid_TSS);
        set_intr_gate(X86_TRAP_NP, segment_not_present);
        set_intr_gate(X86_TRAP_SS, stack_segment);
        set_intr_gate(X86_TRAP_GP, general_protection);
        set_intr_gate(X86_TRAP_SPURIOUS, spurious_interrupt_bug);
        set_intr_gate(X86_TRAP_MF, coprocessor_error);
        set_intr_gate(X86_TRAP_AC, alignment_check);
#ifdef CONFIG_X86_MCE
        set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK);
#endif
        set_intr_gate(X86_TRAP_XF, simd_coprocessor_error);

        /* Reserve all the builtin and the syscall vector: */
        for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
                set_bit(i, used_vectors);

#ifdef CONFIG_IA32_EMULATION
        set_system_intr_gate(IA32_SYSCALL_VECTOR, entry_INT80_compat);
        set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif

#ifdef CONFIG_X86_32
        set_system_trap_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
        set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif
	//The following is omitted ...

I have not confirmed the contents of entry_INT80_compat, but I made a hit on entry_INT80_32 and confirmed the surrounding sources.

kou77@ubuntu:~/linux-4.2/arch/x86$ search entry_INT80_32
./include/asm/proto.h:12:void entry_INT80_32(void);
./kernel/traps.c:901:   set_system_trap_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
./entry/entry_32.S:408:ENTRY(entry_INT80_32)
./entry/entry_32.S:505:ENDPROC(entry_INT80_32)

I had the following code in ./entry/entry_32.S.

ENDPROC(entry_SYSENTER_32)

        # system call handler stub
ENTRY(entry_INT80_32)
        ASM_CLAC
        pushl   %eax                            # save orig_eax
        SAVE_ALL
        GET_THREAD_INFO(%ebp)
                                                # system call tracing in operation / emulation
        testl   $_TIF_WORK_SYSCALL_ENTRY, TI_flags(%ebp)
        jnz     syscall_trace_entry
        cmpl    $(NR_syscalls), %eax
        jae     syscall_badsys
syscall_call:
        call    *sys_call_table(, %eax, 4)
syscall_after_call:
	#The following is omitted ...

Once you understand the handling code above, you'll see how it receives the arguments specified by the user when calling a system call. Before reading the assembler code, which I don't have much know-how, I searched the net for interrupt-related information. The site I found in is next.

-Linux Kernel 2.4 Internals: Process and Interrupt Management

The following is a quote from the above site. This article also seems to have an old kernel version, but it is written in a very easy-to-understand manner.

2.11 How system calls are implemented on i386?

    lcall7/lcall27 call gate
int 0x80 software interrupt

Other UNIX-like OS(Solaris,Unixware 7 etc.)The binaries use the lcall7 mechanism, but native Linux programs use int 0x80.'lcall7'The name is a historical mistake. This is lcall27(For example Solaris/x86)Is also used, but the handler function is lcall7_Because it is called func.

A function arch that sets the IDT when the system boots/i386/kernel/traps.c:trap_init()Is called,(type 15,dpl 3)Vector 0x80, arch/i386/kernel/entry.Set to indicate the system call entry address of S.

When a user-space application makes a system call, it puts its arguments in registers and'int 0x80'Execute the instruction. This is trapped in kernel mode and the processor is entry.Jumps to the S system call entry point. It does the following:


    1.Save the register.
    2. %with ds%es to KERNEL_Set to DS. Therefore, all the referenced data(And external segments)Becomes the kernel address space.
    3.if%The value of eax is NR_syscalls(Currently 256)If it is larger, it will fail with an ENOSYS error.
    4.If the task was ptraced(tsk->ptrace & PF_TRADESYS )At times, special processing is performed. This is strace(Truss in SVR4(1))This is to support programs and debuggers such as.
    5. sys_call_table+4*(%eax syscall_number)To call. This table is the same file(arch/i386/kernel/entry.S)It is initialized with and points to each system call handler. On Linux, handlers usually have sys, for example._open and sys_sys like exit_Is attached with the prefix. These C system call handlers are SAVE_Find the argument from the stack stored by ALL.
    6. ``system call return path''to go into. lcall7 as well as int 0x80,It has a different label because it is also used in lcall27. this is,(Bottom half)Handling of tasklets and schedule()Whether you need(tsk->need_resched != 0)It is related to checking for signals, checking if signals are pending, and processing those signals.

The above article says, "When a user-space application makes a system call, it puts an argument in a register and executes the'int 0x80'instruction."

The process of "save register" seems to be done in SAVE_ALL of the assembler entry_SYSENTER_32 code. At the back of the page of the reference site Assembly Programming Linux (system call) written at the beginning of this article, there was the following description. ..

/usr/src/linux/arch/i386/kernel/entry.S<br />
    83  #define SAVE_ALL \<br />
    84          cld; \<br />
    85          pushl %es; \<br />
    86          pushl %ds; \<br />
    87          pushl %eax; \<br />
    88          pushl %ebp; \<br />
    89          pushl %edi; \<br />
    90          pushl %esi; \<br />
    91          pushl %edx; \<br />
    92          pushl %ecx; \<br />
    93          pushl %ebx; \<br />
    94          movl $(__KERNEL_DS),%edx; \<br />
    95          movl %dx,%ds; \<br />
    96          movl %dx,%es;

So, the following is written as a merge of the contents of SAVE_ALL and entry_INT80_32 above for easy understanding.

It's hard to understand, so system_The call is simplified and shown below. The arguments set in the registers in the assembly are pushed onto the stack, so the sys written in C is called as a system call._It is an argument of the XXXX function.

  ENTRY(system_call)
          pushl %eax                      # save orig_eax
          cld;
          pushl %es; 
          pushl %ds; 
          pushl %eax; 
          pushl %ebp; 
          pushl %edi;                     #5th argument
          pushl %esi;                     #4th argument
          pushl %edx;                     #3rd argument
          pushl %ecx;                     #2nd argument
          pushl %ebx;                     #1st argument
          movl $(__KERNEL_DS),%edx; 
          movl %dx,%ds; 
          movl %dx,%es; 
          movl %esp, %ebx; 
          andl $-8192, %ebx; 
          cmpl $(NR_syscalls),%eax 
          jae badsys 
          testb $0x20,flags(%ebx)         # PF_TRACESYS 
          jne tracesys 
          call *SYMBOL_NAME(sys_call_table)(,%eax,4) 
                                          #Supports eax system calls
                                          #Call a function
          movl %eax,EAX(%esp)             #Set return value for eax on the stack
          popl %ebx; 
          popl %ecx; 
          popl %edx; 
          popl %esi; 
          popl %edi; 
          popl %ebp; 
          popl %eax;                      #Return value set
          popl %ds; 
          popl %es; 
          addl $4,%esp;                   #First pushl%Discard the eax
          iret;                           #Return from system call

The code of SAVE_ALL of the source I am investigating is as follows. (Excerpt from arch / x86 / entry / entry_32.S)

.macro SAVE_ALL
        cld
        PUSH_GS
        pushl   %fs
        pushl   %es
        pushl   %ds
        pushl   %eax
        pushl   %ebp
        pushl   %edi
        pushl   %esi
        pushl   %edx
        pushl   %ecx
        pushl   %ebx
        movl    $(__USER_DS), %edx
        movl    %edx, %ds
        movl    %edx, %es
        movl    $(__KERNEL_PERCPU), %edx
        movl    %edx, %fs
        SET_KERNEL_GS %edx
.endm

Saving the registers specified as arguments to the system call described above onto the stack is similar. By the way, as I wrote in the previous article, the definition of system call entry is ʻAsmlinkage long sys_fork (void), the definition of asmlinkage is I thought it was ./arch/x86/include/asm/linkage.h:10:#define asmlinkage CPP_ASMLINKAGE attribute ((regparm (0))). I thought that regparm (0)was passed in the register, but looking at the above explanation, it seems that it is passed in the stack like a normal function call. Since there is a "0" inregparm (0)`, does it mean that it is not passed in the register? (As I wrote in the previous article, check regparm separately)

It seems that I understand about the processing where the system call is interrupted and handled. After this, we will investigate individual system calls, schedulers, and memory management. (Process management, scheduler first, memory management after)

Recommended Posts

linux (kernel) source analysis: system call call
linux (kernel) source analysis: system call entry function definition
I looked at system call entries. (linux source analysis)
[Linux] [Initial Settings] System Settings
About Linux kernel parameters
Linux kernel release 5.x (2/4)
Check Linux kernel version
Linux kernel release 5.x (3/4)
Linux kernel build time
Source analysis for Django--INSTALLED_APPS
Linux kernel release 5.x (4/4)
Linux system architecture [runlevel]
Linux kernel release 5.x (1/4)
Device and Linux file system
Linux Kernel Build for DE10nano
Self-build linux kernel with clang
What is the Linux kernel?
Understand the Linux audit system Audit
What is a system call
Linux main package management system
Hack Linux fork system calls
[LINUX kernel rebuild] Version upgrade (4.18.0 → 5.8.8)