Hack Linux fork system calls

Recently, I posted a post called "Introduction to Socket API" to learn in C language, and I wrote a program using TCP and UDP for the time being, so it's about time for parallel processing and multiplexing processing, which are indispensable for network programming. I'm thinking of moving.

So, this time I'm going to delve into multi-process processing. An important system call for multi-processing is fork. Originally, I knew that the system call that spawns a child process in Linux was fork, and I somehow knew how to use it.

However, I didn't know what kind of implementation it was realized by, so I checked the source code properly at this time.

However, there were times when it was difficult to follow only the source code, so in such a case, for the time being, output the appropriate code using the fork function to the object code statically linked with gcc, and dispose of it with objdump. While assembling, I was doing extremely simple work while comparing the flow of the source code with the machine language that was actually output.

I don't understand it 100%, and I'm skeptical that the interpretation is correct, but I will continue to investigate and update it little by little so that I can keep accurate ones.

And if you are familiar with it, I would be very grateful if you could give me some advice. We are also looking for friends who can read the open source source code together at the cafe.

Referenced source code

Linux kernel: 2.6.11 proc commands: procps-3.2.8 glibc:glibc-2.12.1 The CPU is x86_64.

What is a process?

A process is an address space for executing a program and a collection of information required for the processing.

The program is usually stored in auxiliary storage such as HDD, but it is read into memory and executed in the process.

The most important program that makes up the OS is called the kernel, but the process is managed by the kernel loaded in memory.

When the CPU is running in Linux, there are two modes, user mode and kernel mode. In kernel mode, system resources can be accessed without restrictions, so processing related to hardware and the entire system can be performed by the user. Mode allows application processing in a unique address space that does not require access to such resources.

However, by issuing system calls within the process, it is possible to temporarily switch the CPU to kernel mode and execute CPU-dependent instructions. By the way, a system call is a specific group of processes that are requested by a process to the OS to execute in kernel mode.

Now let's use the ps command with options to list and display the processes that currently exist on my Linux.

ps -ef f

root     21449     1  0 21:34 ?        Ss     0:00 php-fpm: master process (/etc/php-fpm.conf)
apache   21450 21449  0 21:34 ?        S      0:00  \_ php-fpm: pool www

A process has information in a task_struct structure called a process descriptor, and has a member of type pid_t that has a one-to-one correspondence with the process called process ID. In this case, 21449 and 21450 in the second column mean the process ID.

The task_struct structure is defined in ʻinclude / linux / sched.h` in the Linux kernel, but since it was a structure with nearly 200 lines, I will refrain from quoting it. It's an important structure packed with information about the process, so it's a good idea to read it during the long autumn nights.

And the third column shows the process ID of the process that spawned that process. Some processes are usually spawned in some parent process.

You can see that the parent process on the second line is the process on the first line, but the process ID on the first line is 1.

This represents a process called init, a process that takes over like a foster parent when the parent process does not exist or the parent dies before the child. Init is a kernel-generated process that is also the ancestor of all processes.

We will check the state in which init is the parent by executing the program later, but first, let's simply write and execute a program that creates a child process.

python


#include <sys/types.h> /* pid_t */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {

    pid_t pid, ret;

    printf("start!\n");

    if((ret = fork()) == -1){
        perror("fork() failed.");
        exit(EXIT_FAILURE);
    }
    pid = getpid();

    printf("return from fork() is %d\n", ret);
    printf("pid is %d\n", pid);

    printf("end!\n");

    return 0;
}

First, it displays start !, and the fork function makes a fork system call (which will be revealed later, but internally clones). Fork () returns -1 when it fails, in which case it immediately terminates program execution.

If fork () succeeds, get the process ID with the getpid function to get the ID of the current process and display it. In this case, the tgid member, which is the process ID of the thread group leader, is acquired instead of the pid member, which indicates the exact process ID of the process descriptor structure.

The getpid system call calls the sys_getpid function in the kernel, but the source code is implemented as follows on line 967 of kernel / timer.c.

c:linux-2.6.11.1_kernel/timer.c


asmlinkage long sys_getpid(void)
{
    return current->tgid;
}

The variable current here is a pointer to the current process descriptor, the task_struct type structure, depending on the processor. You can see that it points to the tgid member, not the pid member of the task_struct type structure object.

Keep this difference in the corner of your mind as it has important implications when thinking about multithreaded programming, not multiprocess.

And the process ID of the ps command also shows the tgid member. The following is a part of the source code of the ps command, but to display the process list, load the process under / proc and put the process ID in the tgid member and tid member through p which is a pointer to the proc_t structure. It is stored.

c:procps-3.2.8_proc/readproc.c


static int simple_nextpid(PROCTAB *restrict const PT, proc_t *restrict const p) {
  static struct direct *ent;        /* dirent handle */
  char *restrict const path = PT->path;
  for (;;) {
    ent = readdir(PT->procfs);
    if(unlikely(unlikely(!ent) || unlikely(!ent->d_name))) return 0;
    if(likely( likely(*ent->d_name > '0') && likely(*ent->d_name <= '9') )) break;
  }
  p->tgid = strtoul(ent->d_name, NULL, 10); 
  p->tid = p->tgid;
  memcpy(path, "/proc/", 6);
  strcpy(path+6, ent->d_name);  // trust /proc to not contain evil top-level entries
  return 1;
}

fork returns 0 for a child process and the process ID of a child process for a parent process. The following is the execution result of the program and the status of the ps command being executed.

Execution result


start!
return from fork() is 21526
pid is 21525
end!
return from fork() is 0
pid is 21526
end!

status of ps command


tajima   21181 21180  0 20:51 pts/0    Ss     0:00  |       \_ -bash
tajima   21525 21181  0 22:24 pts/0    S+     0:00  |           \_ ./a.out
tajima   21526 21525  0 22:24 pts/0    S+     0:00  |               \_ ./a.out

Whether the child process or the parent process is executed depends on the scheduler, but in this case you can see that the parent process and the child process were executed in that order.

You will notice that the start! Is displayed only once here. This is because the new process starts executing immediately after the fork system call. You will understand the reason later because we will follow the internal processing of fork.

Now, let's change the previous program as follows. Whether it is a parent process or a child process is determined by the return value of the getpid function, but the sleep is executed for 10 seconds for the parent process and 30 seconds for the child process.

python


    pid = getpid();

    if(ret == 0){ 
        sleep(30);
    } else {
        sleep(10);
    }   

    printf("return from fork() is %d\n", ret);
    printf("pid is %d\n", pid);

If you check with the ps command immediately after starting the program,

tajima   22501 22438  0 21:32 pts/0    S+     0:00  |           \_ ./a.out
tajima   22502 22501  0 21:32 pts/0    S+     0:00  |               \_ ./a.out

The result is the same as before, but when the parent process wakes up from sleep and finishes executing the program, if you execute the ps command again, tajima 22502 1 0 21:32 pts/0 S 0:00 ./a.out You can see that the parent of the child process is the process with process ID 1, that is, init.

In some cases, you may want to perform a process that is synchronized between the parent and the child.

There are several ways to do this, such as waiting for the parent process to finish the child process, or detecting the end of the child process with a signal.

The program below executes the waitpid () system call to block the execution of the parent process until the child process terminates.

After the child process terminates, the parent process is destroyed 10 seconds later.

python


#include <sys/wait.h>
#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {

    pid_t pid, ret;

    printf("start\n");

    if((ret = fork()) == -1){
        perror("fork() failed.");
        exit(EXIT_FAILURE);
    }   
    pid = getpid();

    if(ret == 0){ 
        sleep(30);
    } else {
        if(waitpid(-1, NULL, 0) < 0) {
            perror("waitpid() failed.");
            exit(EXIT_FAILURE);
        }   
        sleep(10);
    }   

    printf("return from fork() is %d\n", ret);
    printf("pid is %d\n", pid);

}

waitpid () can receive the exit status of a child process by passing an int type pointer as an argument to the second argument, or change its behavior by specifying an optional OR in the third argument.

This time, it is assumed that control will be returned to the parent process due to the state change of the termination of the child process, but it is possible to detect not only the termination but also the state change such as stop and restart.

As usual, when I execute the ps command, the following is displayed for a while,

tajima   22438 22437  0 21:28 pts/0    Ss     0:00  |       \_ -bash
tajima   22609 22438  0 22:33 pts/0    S+     0:00  |           \_ ./a.out
tajima   22610 22609  0 22:33 pts/0    S+     0:00  |               \_ ./a.out

You can see that the following is displayed when the child process ends.

tajima   22438 22437  0 21:28 pts/0    Ss     0:00  |       \_ -bash
tajima   22609 22438  0 22:33 pts/0    S+     0:00  |           \_ ./a.out

Supplementally, the child process is in a zombie process until the parent process calls a wait-type system call to clean up the process. init cleans up by issuing a wait4 system call to become the parent of this lost zombie and release the soul.

Hack fork

Now let's take a look at the fork implementation. The following is a disassembled version of an executable file that is statically linked to a program that uses the fork function.

fork.S


0000000000400494 <main>:
  400494:   55                      push   %rbp   
  400495:   48 89 e5                mov    %rsp,%rbp
  400498:   e8 f3 d1 00 00          callq  40d690 <__libc_fork>
  40049d:   b8 00 00 00 00          mov    $0x0,%eax
  4004a2:   c9                      leaveq 
  4004a3:   c3                      retq   

We are calling a function with the symbol __libc_fork. So if you follow __libc_fork in the same file, it's just what the CPU is doing, but it's hard to read it suddenly, so first check the implementation with the C language source code. I will.

c:glibc-2.12.1_nptl/sysdeps/unix/sysv/linux/fork.c


pid_t
__libc_fork (void)
{
  pid_t pid;
  struct used_handler
  {
    struct fork_handler *handler;
    struct used_handler *next;
  } *allp = NULL;

  /* Run all the registered preparation handlers.  In reverse order.
     While doing this we build up a list of all the entries.  */
  struct fork_handler *runp;
  while ((runp = __fork_handlers) != NULL)
    {   
      /* Make sure we read from the current RUNP pointer.  */
      atomic_full_barrier (); 

      unsigned int oldval = runp->refcntr;

      if (oldval == 0)
    /* This means some other thread removed the list just after
       the pointer has been loaded.  Try again.  Either the list
       is empty or we can retry it.  */

/*abridgement*/

#ifdef ARCH_FORK
  pid = ARCH_FORK (); 
#else
# error "ARCH_FORK must be defined so that the CLONE_SETTID flag is used"
  pid = INLINE_SYSCALL (fork, 0); 
#endif
/*abridgement*/

If you follow the macro ARCH_FORK, you will come across a macro called INLINE_SYSCALL, so we will follow it further.

c:glibc-2.12.1_nptl/sysdeps/unix/sysv/linux/x86_64/fork.c


#define ARCH_FORK() \
  INLINE_SYSCALL (clone, 4,                           \
          CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0,     \
          NULL, &THREAD_SELF->tid)

c:glibc-2.12.1_sysdeps/unix/sysv/linux/x86_64/sysdep.h


# define INLINE_SYSCALL(name, nr, args...) \
  ({                                          \
    unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args);        \
    if (__builtin_expect (INTERNAL_SYSCALL_ERROR_P (resultvar, ), 0))         \
      {                                       \
    __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));           \
    resultvar = (unsigned long int) -1;                   \
      }                                       \
    (long int) resultvar; })

# define INTERNAL_SYSCALL(name, err, nr, args...) \
  INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args)

# define INTERNAL_SYSCALL_NCS(name, err, nr, args...) \
  ({                                          \
    unsigned long int resultvar;                          \
    LOAD_ARGS_##nr (args)                             \
    LOAD_REGS_##nr                                \
    asm volatile (                                \
    "syscall\n\t"                                 \
    : "=a" (resultvar)                                \
    : "0" (name) ASM_ARGS_##nr : "memory", "cc", "r11", "cx");            \
    (long int) resultvar; })

The macro INLINE_SYSCALL is further internally converted to a macro function called INTERNAL_SYSCALL_NCS.

Inside that, the value of __NR_clone is stored in the rax register in the description of the inline assembler, and You can see that the clone system call is being executed.

By the way, the output operand = a means that it outputs to the rax register, and the input operand 0 means that the input register is also the same rax register as the output.

c:linux-2.6.11.1_sysdeps/unix/sysv/linux/x86_64/sysdep.h


#define __NR_clone                              56

You can see that the clone system call number is 56 on x86_64 series processors. Let's check the proof with the code in the assembler earlier.

fork.S


  40d736:   b8 38 00 00 00          mov    $0x38,%eax
  40d73b:   0f 05                   syscall

The hexadecimal number 0x38 is stored in the eax register (rax). This means the decimal number 56.

The clone () system call is implemented by the sys_clone function, which internally calls the do_fork function.

c:linux-2.6.11.1arch/x86_64/kernel/process.c


asmlinkage long sys_clone(unsigned long clone_flags, unsigned long newsp, void __user *parent_tid, void __user *child_tid, struct pt_regs *regs)
{
    if (!newsp)
        newsp = regs->rsp;
    return do_fork(clone_flags, newsp, regs, 0, parent_tid, child_tid);
}

c:linux-2.6.11.1_kernel/fork.c_1124-1134


long do_fork(unsigned long clone_flags,
          unsigned long stack_start,
          struct pt_regs *regs,
          unsigned long stack_size,
          int __user *parent_tidptr,
          int __user *child_tidptr)
{
    struct task_struct *p;
    int trace = 0; 
    long pid = alloc_pidmap();

Manage PID usage with long pid = alloc_pidmap (); Gets a new PID for the child process from an array called pidmap_array.

c:linux-2.6.11.1_kernel/pid.c


typedef struct pidmap {                                                
    atomic_t nr_free;                                                  
    void *page;                                                        
} pidmap_t;       

static pidmap_t pidmap_array[PIDMAP_ENTRIES] =                         
     { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };

pidmap_array has a structure called struct pidmap as an element and is typedefed to the type pidmap_t.

c:linux-2.6.11.1_kernel/fork.c_1142-1143



    p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr, pid);

The copy_process function returns a pointer to the task_struct structure, which is a copy of the process descriptor of the parent process. Below, I will look at only the main points.

c:linux-2.6.11.1_kernel/fork.c_820


p = dup_task_struct(current); 

current points to a pointer to the task_struct structure of the current process, but copies this content to the task_struct structure of the child process and returns that pointer. After that, various initialization processes are performed on the pointer to the task_struct structure of the copied child process.

Then, the copy_thread function is called with a pointer to the pt_regs structure as an argument to set the kernel mode stack of the child process. The pt_regs structure stores the register values when kernel mode is called.

include/asm-x86_64/ptrace.h_39-65


struct pt_regs {
    unsigned long r15;
    unsigned long r14;
    unsigned long r13;
    unsigned long r12;
    unsigned long rbp;
    unsigned long rbx;
/* arguments: non interrupts/non tracing syscalls only save upto here*/
    unsigned long r11;
    unsigned long r10;  
    unsigned long r9; 
    unsigned long r8; 
    unsigned long rax;
    unsigned long rcx;
    unsigned long rdx;
    unsigned long rsi;
    unsigned long rdi;
    unsigned long orig_rax;
/* end of arguments */  
/* cpu exception frame or undefined */
    unsigned long rip;
    unsigned long cs; 
    unsigned long eflags; 
    unsigned long rsp; 
    unsigned long ss; 
/* top of stack page */ 
};

c:linux-2.6.11.1_arch/x86_64/kernel/process.c_366-386


int copy_thread(int nr, unsigned long clone_flags, unsigned long rsp, 
        unsigned long unused,
    struct task_struct * p, struct pt_regs * regs)
{
    int err;
    struct pt_regs * childregs;
    struct task_struct *me = current;

    childregs = ((struct pt_regs *) (THREAD_SIZE + (unsigned long) p->thread_info)) - 1;

    *childregs = *regs;

    childregs->rax = 0;
    childregs->rsp = rsp;
    if (rsp == ~0UL) {
        childregs->rsp = (unsigned long)childregs;
    }   

    p->thread.rsp = (unsigned long) childregs;
    p->thread.rsp0 = (unsigned long) (childregs+1);
    p->thread.userrsp = me->thread.userrsp; 

childregs = ((struct pt_regs *) (THREAD_SIZE + (unsigned long) p-> thread_info)) --1; This process places the pt_regs structure at the top address of the kernel stack in the memory area allocated to the process. It is a process of. I think that the implementation is that the address obtained by subtracting the size 1 of the pt_regs structure becomes the storage location of pt_regs.

For childregs, the whole contents of regs are copied once, and then the register value is partially changed.

What you should pay attention to here is childregs-> rax = 0. I put 0 in the rax register. In C language, the return value is supposed to be returned in the rax register or eax register, so in the case of a child process, this is the return value that is returned to the user process space as it is.

p-> thread.rsp0 contains the top-level address of the stack.

c:linux-2.6.11.1_kernel/fork.c_32-38


        if ((p->ptrace & PT_PTRACED) || (clone_flags & CLONE_STOPPED)) {                                                                      
            /*
             * We'll start up with an immediate SIGSTOP.              
             */ 
            sigaddset(&p->pending.signal, SIGSTOP);                   
            set_tsk_thread_flag(p, TIF_SIGPENDING);                   
        }           

If the debugger is monitoring the child process or the CLONE_STOPPED flag is set, stop the execution of the child process.

c:linux-2.6.11.1_kernel/fork.c_40-43


        if (!(clone_flags & CLONE_STOPPED))
            wake_up_new_task(p, clone_flags);                         
        else
            p->state = TASK_STOPPED;

If the CLONE_STOPPED flag is not set, execute the wake_up_new_task function to adjust the scheduling of the parent-child process properly, and if the CLONE_STOPPED flag is set, set the TASK_STOPPED flag in the state member.

c:linux-2.6.11.1_kernel/fork.c_45-48


        if (unlikely (trace)) {
            current->ptrace_message = pid; 
            ptrace_notify ((trace << 8) | SIGTRAP);
        }    

This is a process for the debugger. This process allows the debugger to properly trace the parent and child processes.

c:linux-2.6.11.1_kernel/fork.c_50-54


        if (clone_flags & CLONE_VFORK) {
            wait_for_completion(&vfork);
            if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE))
                ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
        }  

This is the process when the vfork () system call is used. Since vfork shares the same memory address space between the parent process and the child process, it stops the execution of the parent process until the child process terminates. As will be described later, this is one of the devices implemented by Linux to eliminate the waste of replication by forks.

c:linux-2.6.11.1_arch/x86_64/kernel/process.c


asmlinkage long sys_vfork(struct pt_regs *regs)
{
    return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->rsp, regs, 0,
            NULL, NULL);
}

c:linux-2.6.11.1_kernel/fork.c_55-59


    } else {
        free_pidmap(pid);
        pid = PTR_ERR(p);
    }
    return pid;

If the process with the copy_process function fails, the process ID acquired for the child process is released by setting an unused flag in pidmap_array, and then converted to an error code and stored in pid. If it does not fail, the process ID of the child process obtained as it is will be returned.

Then, the process returns to __libc_fork in user mode, branches in the parent process and child process depending on whether the pid is 0, and performs appropriate post-processing. It then returns the process ID to the user process.

Isn't a copy of the parent process useless?

Linux uses a method of creating a new process by copying the resources of the parent process to the child, but usually the child process uses a system call such as execve to process an address space different from that of the parent process. This duplication process seems inefficient, as you often do.

To address these issues, both processes share the same physical page and share the kernel data structure for each process, such as copy-on-write, which allocates a new page only when one changes, and an open file descriptor. We are trying to improve efficiency by such methods.

Next time, when I write about the introduction to socket API to learn in C language, I will do programming using multi-process.

Recommended Posts

Hack Linux fork system calls
[Linux] [Initial Settings] System Settings
Hack Linux file descriptors
Linux system architecture [runlevel]
I tried adding system calls and scheduler to Linux
Device and Linux file system
Understand the Linux audit system Audit
Linux main package management system
Efficient use of Linux file system