Hack Linux file descriptors

You've heard the term file descriptor (socket descriptor) in both programs that process data over the network in Linux and programs that process data in local files.

It also appeared as a socket descriptor in Introduction to Socket API Learned in C Part 1 Server Edition.

Most people understand that an integer value is used to identify a file that is ready to interact with a process, but not everyone knows so deeply beyond that. Is it?

Of course, as long as you make an application on Linux, you don't need to know any more, and if you can properly select and program the API related to CRUD using the file descriptor, there is no problem as a programmer. (The problem with file descriptors is that the process has exceeded the maximum number of file descriptors it can open, but then you have to raise that limit or really open it that much. We will review the application design to see if it becomes possible.)

In C language, it looks like the following.

`python`


#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int main (int argc, char* argv[]) {

    int fd; // File Descriptor

    fd = open("tmp.txt", O_RDONLY);
    if (fd < 0) {
        perror("open() failed.");
        exit(EXIT_FAILURE);
    }   

    printf("%d\n", fd); // 3

    close(fd);

    return EXIT_SUCCESS;
}

The integer type fd is the file descriptor. If it is normal, a number greater than or equal to 0 will be acquired.

Also, file descriptors are usually wrapped by programming languages and libraries, and I think they interact with CRUD under more efficient control data structures.

In the following cases, fopen adjusts the number of system call calls and other measures are taken to enable efficient IO processing, so I think that these APIs are usually used. Other programming languages also have such APIs, and I often write PHP, but PHP also has a utility function called fopen.

`python`


#include <stdio.h>
#include <stdlib.h>

int main (int argc, char* argv[]) {

    FILE *fp; // File pointer

    fp = fopen("tmp.txt", "r");
    if (fp == NULL) {
        fprintf(stderr, "fopen() failed.\n");
        exit(EXIT_FAILURE);
    }   

    printf("%d\n", fp->_fileno); // 3

    fclose(fp);

    return EXIT_SUCCESS;
}

Linux (Unix) is a system that sublimates everything in the system into the abstract concept of files. The general concept of a Linux file system isn't too complicated to talk about for days, it's so simple that it's easy to talk about.

If programmers as well as general users can grasp the nuances of the abstract concept of files, an interface is provided that allows them to operate everything intuitively and equally.

The file descriptor is then used as an identifier to access the abstract file entity.

I thought that if I could know a little more about the file descriptors that form the basis of the system, I could make better use of the system called Linux, so I investigated it.

However, I don't know how to investigate even if it is vaguely called a file descriptor, so I decided to investigate the system call related to it.

open system call hack

I will explore it by following the series of processes of open that came out earlier. You can get the file descriptor as a result of open, so if you follow this process, you should know something.

`python`


 415879:   b8 02 00 00 00          mov    $0x2,%eax
 41587e:   0f 05                   syscall

Quickly, let's disassemble and check the processing of the __libc_open function of glibc corresponding to the open function. The open system call number is 2.

`include/asm-x86_64/unistd.h`


#define __NR_open                                2
__SYSCALL(__NR_open, sys_open)

You can see that the corresponding function in the sys_call_table table is sys_open.

Now, let's follow the process of sys_open.

`fs/open.c(933-941)`


asmlinkage long sys_open(const char __user * filename, int flags, int mode)
{
    char * tmp; 
    int fd, error;

#if BITS_PER_LONG != 32
    flags |= O_LARGEFILE;
#endif

First, for 64-bit machines, automatically enable the O_LARGEFILE flag. This will allow you to handle files larger than 2GB. Nowadays on 64-bit machines, you don't have to worry about this, but you need to be careful when implementing the process of handling files on 32-bit machines in your application. By the way, I've heard that you can't open files larger than 2GB with modules whose development has stopped in the 32-bit machine era, but you didn't specify this flag.

`fs/open.c(941)`


    tmp = getname(filename);

Use the getname function to transfer the object containing the file name from user space to kernel space. The memory area is allocated using the slab allocator. The slab allocator is one of the methods Linux uses for efficient memory allocation. I would like to delve into the slab allocator soon.

`fs/open.c(942-944)`


    fd = PTR_ERR(tmp);
    if (!IS_ERR(tmp)) {
        fd = get_unused_fd();

If there are no errors, run the get_unused_fd function to find a free file descriptor. Judging from the name of the function, it seems that this process will tell you what the file descriptor is.

`fs/open.c(838-845)`


int get_unused_fd(void)
{
    struct files_struct * files = current->files;
    int fd, error;

    error = -EMFILE;
    spin_lock(&files->file_lock);

I also explained when fork system call, but current is a pointer to the process descriptor (task_struct structure) of the process running on the current CPU. Is a macro to get.

current-> files contains a pointer to the files_struct structure, which contains information about the files opened by the current process.

The files_struct structure has the following structure. Please keep in mind that you will have to check here again at the end.

`include/linux/file.h`


/*
 * Open file table structure
 */
struct files_struct {
        atomic_t count;
        spinlock_t file_lock;     /* Protects all the below members.  Nests inside tsk->alloc_lock */
        int max_fds;
        int max_fdset;
        int next_fd;
        struct file ** fd;      /* current fd array */
        fd_set *close_on_exec;
        fd_set *open_fds;
        fd_set close_on_exec_init;
        fd_set open_fds_init;
        struct file * fd_array[NR_OPEN_DEFAULT];
};

files-> open_fds is a pointer to the fd_set structure files-> open_fds_init, and files-> open_fds_init represents the currently open file descriptor as a bitmap of 64x16 = 1024. Below is an excerpt of the rationale for the calculation I made.

`include/linux/posix_types.h`


#define __NFDBITS   (8 * sizeof(unsigned long))
#define __FD_SETSIZE    1024
#define __FDSET_LONGS   (__FD_SETSIZE/__NFDBITS)

typedef struct {
    unsigned long fds_bits [__FDSET_LONGS];
} __kernel_fd_set;

Normally, 1024 bits will suffice, but if it's not enough, it will be extended by the expand_files function. It's related to the story I talked about at the beginning, that if you have fewer processes to open, you can raise the limit.

`fs/open.c(847)`


    fd = find_next_zero_bit(files->open_fds->fds_bits,
                files->max_fdset,
                files->next_fd);

From here, the find_next_zero_bit function finds and gets the free bits.

`fs/open.c(872-874)`


    FD_SET(fd, files->open_fds);
    FD_CLR(fd, files->close_on_exec);
    files->next_fd = fd + 1;

Adds a new file descriptor to the set of open file descriptors and removes it from the set of descriptors that are closed during exec ().

You can see that this process has an advantage in communicating with the child process. It is also related to the processing of pipes that realize interprocess communication.

Then update the next_fd member to the maximum number of assigned file descriptors plus one. It will be used for the next file descriptor scan.

Well, I tried to get the file descriptor, but I still can't see the actual file descriptor. Still, I feel like I just got a free file descriptor as a key for the time being.

It returns to the processing of sys_open again. fd is returned as a file descriptor.

`fs/open.c(945-951)`


        if (fd >= 0) { 
            struct file *f = filp_open(tmp, flags, mode);
            error = PTR_ERR(f);
            if (IS_ERR(f))
                goto out_error;
            fd_install(fd, f);
        }

Execute the filp_open function to get the address of the file structure obtained based on the file path, access mode, and permission bits passed as arguments.

The file structure is as follows. Since we have all the important members, I would like to make a different article just for the file structure.

`include/linux/fs.h`


struct file {
    struct list_head    f_list;
    struct dentry       *f_dentry;
    struct vfsmount         *f_vfsmnt;
    struct file_operations  *f_op;
    atomic_t        f_count;
    unsigned int        f_flags;
    mode_t          f_mode;
    int         f_error;
    loff_t          f_pos;
    struct fown_struct  f_owner;
    unsigned int        f_uid, f_gid;
    struct file_ra_state    f_ra;

    size_t          f_maxcount;
    unsigned long       f_version;
    void            *f_security;

    /* needed for tty driver, and maybe others */
    void            *private_data;

#ifdef CONFIG_EPOLL
    /* Used by fs/eventpoll.c to link all the hooks to this file */
    struct list_head    f_ep_links;
    spinlock_t      f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
    struct address_space    *f_mapping;
};

Now, let's follow the processing of the filp_open function.

`fs/open.c(753-762)`


struct file *filp_open(const char * filename, int flags, int mode)
{
    int namei_flags, error;
    struct nameidata nd;

    namei_flags = flags;
    if ((namei_flags+1) & O_ACCMODE)
        namei_flags++;
    if (namei_flags & O_TRUNC)
        namei_flags |= 2;

Set the access mode appropriately for namei_flags. Notice that we are converting the flags in namei_flags to a special format.

In binary, 00 (read only) goes to 01, 01 (write only) goes to 10, and 10 (read and write only) goes to 11. In other words, if 0 bit is set, it means read, and if 1 bit is set, it means write.

This converted flag will be used for later processing.

`fs/open.c(764)`


    error = open_namei(filename, namei_flags, mode, &nd);

The open_namei function handles the important part of the actual file opening. As arguments, pass the filename, the converted access mode flag, the permission bits, and a pointer to the nameidata structure.

The nameidata structure is defined as below.

`include/linux/namei.h`


struct nameidata {
    struct dentry   *dentry;
    struct vfsmount *mnt;
    struct qstr last;
    unsigned int    flags;
    int     last_type;
    unsigned    depth;
    char *saved_names[MAX_NESTED_LINKS + 1]; 

    /* Intent data */
    union {
        struct open_intent open;
    } intent;
};

This is also an important structure, so I'd like to take some time to dig into it, but what I should pay attention to now is the state of the dentry member, which is a pointer to the dentry structure, and the file system mounted on the system. An mnt member that is a pointer to a vfsmount structure that records.

The task inside the open_namei function is to get the object of the nameidata structure based on the pathname and flags.

The main part of the process is done in another function called path_lookup, but the search method is finely adjusted according to the access mode, and the information of the file system associated with the process from the current process descriptor is stored in fs. The search process is started based on the information.

As a result, the nameidata structure contains the data resulting from the search by pathname.

`fs/open.c(764-768)`


    if (!error)
        return dentry_open(nd.dentry, nd.mnt, flags);

    return ERR_PTR(error);
}

At this point, you have a pointer to the dentry structure for the pathname and a pointer to the vfsmount structure. The dentry_open function creates a file object based on the two pieces of information and the converted flags.

`fs/open.c(945-951)`


        if (fd >= 0) { 
            struct file *f = filp_open(tmp, flags, mode);
            error = PTR_ERR(f);
            if (IS_ERR(f))
                goto out_error;
            fd_install(fd, f);
        }

The variable f stores the pointer to the file object obtained by the dentry_open function. And if there are no errors, I pass a pointer to the file descriptor and the file object to the fd_install function, which is the core of the file descriptor.

`fs/open.c(945-951)`


void fastcall fd_install(unsigned int fd, struct file * file)
{
    struct files_struct *files = current->files;
    spin_lock(&files->file_lock);
    if (unlikely(files->fd[fd] != NULL))
        BUG();
    files->fd[fd] = file;
    spin_unlock(&files->file_lock);
}

The member of the current process descriptor (files_struct *) The member of files (fd_set *) fd stores the pointer to the file object obtained by the filp_open function earlier.

This fd member is a pointer to an array of pointers to the file object (struct file **), but the integers in the file descriptor match the subscripts in this array.

By the way, fd points to the fd_array member in the same structure, but in the case of a 64-bit machine, if the number of file descriptors exceeds 64, a new area will be allocated and point to that address.

That is, files-> fd [0] points to file descriptor 0, and files-> fd [1] points to file object in file descriptor 1.

As is well known, 0 usually corresponds to standard input, 1 corresponds to standard output, and 2 corresponds to standard error output, so the descriptor of the first file opened by the user should be 3. Even the program created in the first demo is displayed as 3.

So what is a file descriptor in Linux? If you answer about

__ Subscript of the array that contains the file object open to the current process __

It can be said.

It was a little difficult to understand in words, so intuitively it is as follows.

`python`


//[fd]Is a file descriptor
current->files->fd[fd]

Then, the operation on the opened file is performed by accessing the information of the file object using the subscript of the fd member.

The file object has an f_pos member that records the offset to the transferred bytes, so you can continue to process the file by updating its value with the number of transferred bytes. The operation itself is performed using the file operation functions (read, write, etc.) defined in the f_op member in the file object.

By the way, from 2.6.14, the fdtable structure is newly created, and the files_struct structure has changed accordingly. The structure below has changed little by little, but the underlying structure remains the same up to Linux 4.9.

`c:Linux2.6.14`


struct fdtable {
    unsigned int max_fds;
    int max_fdset;
    int next_fd;
    struct file ** fd;      /* current fd array */
    fd_set *close_on_exec;
    fd_set *open_fds;
    struct rcu_head rcu;
    struct files_struct *free_files;
    struct fdtable *next;
};

struct files_struct {
    atomic_t count;
    struct fdtable *fdt;
    struct fdtable fdtab;
    fd_set close_on_exec_init;
    fd_set open_fds_init;
    struct file * fd_array[NR_OPEN_DEFAULT];
    spinlock_t file_lock;     /* Protects concurrent writers.  Nests inside tsk->alloc_lock */
};

void fastcall fd_install(unsigned int fd, struct file * file)
{
    struct files_struct *files = current->files;
    struct fdtable *fdt;
    spin_lock(&files->file_lock);
    fdt = files_fdtable(files);
    BUG_ON(fdt->fd[fd] != NULL);
    rcu_assign_pointer(fdt->fd[fd], file);
    spin_unlock(&files->file_lock);
}

Therefore, instead of referencing the fd member directly from the files_struct structure, the flow is to refer to fd (a pointer to an array of pointers to the file object) by going through the fdt member once.

General purpose file descriptor

Now, let's take a step further with file descriptors.

At the beginning, I mentioned that both programs that process data over the network in Linux and programs that process data in local files use file descriptors (socket descriptors).

Introduction to Socket API Learned in C # 3 Server / Client Edition # 1 uses socket APIs such as recv and send, but this process is read. You can replace it with a basic I / O API such as or write.

This is because what the socket gets is the file descriptor, and the entity that can be accessed by it is nothing but the file object. The following is a partial excerpt of the process for opening a socket.

`net/socket.c(377)`


struct file *file = get_empty_filp();

Get a pointer to a new file object with the get_empty_filp function. This function also allocates a memory area for the file object with the slab allocator.

`net/socket.c(402-407)`


        sock->file = file;
        file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops;
        file->f_mode = FMODE_READ | FMODE_WRITE;
        file->f_flags = O_RDWR;
        file->f_pos = 0; 
        fd_install(fd, file);

Make various settings for the acquired file object, and finally fd_install. It's the same as the open system call.

It's not just sockets. Even the processing of pipes. The following is a partial excerpt of the processing of the pipe system call.

`fs/pipe.c(402-407)`


    f1 = get_empty_filp();
    if (!f1)
        goto no_files;

    f2 = get_empty_filp();
    if (!f2)
        goto close_f1;

f1 is the file object for reading and f2 is the file object for writing.

`fs/pipe.c(760-774)`


    /* read file */
    f1->f_pos = f2->f_pos = 0;
    f1->f_flags = O_RDONLY;
    f1->f_op = &read_pipe_fops;
    f1->f_mode = FMODE_READ;
    f1->f_version = 0;

    /* write file */
    f2->f_flags = O_WRONLY;
    f2->f_op = &write_pipe_fops;
    f2->f_mode = FMODE_WRITE;
    f2->f_version = 0;

    fd_install(i, f1);
    fd_install(j, f2);

You can see that it is the same process as open because various settings are made to each file object and the last is fd_install.

As you can see, file descriptors play a very important role in a system called Linux that tries to manipulate everything with files. Just in case, I also read the currently stable Linux 4.7 code, but this principle processing has not changed, and this idea is still alive.

I would like to take a closer look at various data structures and processes that I did not explain much this time, even when exploring the file system separately.

Referenced source code

Linux kernel: 2.6.11 glibc:glibc-2.12.1 CPU:x86_64

There is no particular reason for the version, but it seems that it is neither too old nor too new.