[LINUX] I tried a system call called fsopen (2), but I couldn't beat EINVAL ...

I'm sorry I'm completely late, Linux Advent Calendar 2020 This is the article on the 20th day. Today I'm going to talk about trying fsopen (2) provided in Linux-5.9.3 (but I couldn't get to the point where I moved the sample ...).

background

I have a hobby (?) Of accessing The Linux Kernel Archives when I personally feel like it, and trying to build it at hand if a new version of the Linux kernel is released. -I saw a commit log called Added system call fsopen (2) in the source code of 5.9.3. fsopen (2) ...? Today's story is that I thought it was a system call I had never heard of, so I tried it for a while.

What is fsopen (2)?

fsopen (2) seems to be a new system call, which looks like a feature for dynamically mounting a filesystem at process execution by passing a block device (the specific use case has been clarified personally). Not...).

fsopen (2) is defined in linux-5.9.3fs/fsopen.c. It looks like getting a file descriptor for operation on the file system specified by the argument _fs_name.

108 /*
109  * Open a filesystem by name so that it can be configured for mounting.
110  *
111  * We are allowed to specify a container in which the filesystem will be
112  * opened, thereby indicating which namespaces will be used (notably, which
113  * network namespace will be used for network filesystems).
114  */
115 SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
116 {
...
128     fs_name = strndup_user(_fs_name, PAGE_SIZE);
129     if (IS_ERR(fs_name))
130         return PTR_ERR(fs_name);
131
132     fs_type = get_fs_type(fs_name);
133     kfree(fs_name);
134     if (!fs_type)
135         return -ENODEV;
136
137     fc = fs_context_for_mount(fs_type, 0);
...
148     return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);

The value specified for the argument _fs_name is given in the form of a sample in commit message of fsopen (2). The following is a quote from the commit message. It seems that you can just specify the file system name such as ext4 or afs.

For example:

	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
...
	sfd = fsopen("afs", -1);

Furthermore, looking at the commit message, it seems that the necessary options are set by calling fsconfig (2) for the file descriptor obtained by fsopen (2). And finally, by calling fsmount (2) move_mount (2), it looks like mounting the filesystem at runtime.

	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
	fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
	fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
	fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
	fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
	fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	fsinfo(sfd, NULL, ...); // query new superblock attributes
	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

Let's call fsopen (2) from the userland side using the sample as an example.

Linux-5.9.3/samples/vfs/test-fsmount.c provides a sample that usesfsopen (2). Let's use this to actually see the behavior of fsopen (2). (But, as I'll explain later, ** I couldn't get the sample to work at this time ** ... (T_T))

Addictive point when building a sample

Comparing the sample code in Linux-5.9.3/samples/vfs/test-fsmount.c and commit message of fsopen (2), there are cases where the option names are slightly different. There seems to be. When actually moving the sample, it seems necessary to read it with the current definition (such as macro constants) while comparing it with the source code.

Also, system calls like the familiar (?) Open (2) have system call call definitions for userland, but build Linux-5.9.3 and install the kernel. In the case, this part will be managed by yourself.

Looking at Linux-5.9.3/samples/vfs/test-fsmount.c, there is the following definition, define the system call number in __NR_fsopen, and make a system call with syscall (). The implementation is to call it directly. In addition, these values ​​are -1, and as you can see from the source code comment" Hope -1 isn't a syscall ", you have to set a system call number such asfsopen (2) That's right....

/* Hope -1 isn't a syscall */
#ifndef __NR_fsopen
#define __NR_fsopen -1
#endif
#ifndef __NR_fsmount
#define __NR_fsmount -1
#endif
#ifndef __NR_fsconfig
#define __NR_fsconfig -1
#endif
#ifndef __NR_move_mount
#define __NR_move_mount -1
#endif
...
static inline int fsopen(const char *fs_name, unsigned int flags)
{
        return syscall(__NR_fsopen, fs_name, flags);
}

Know the system call number

If you search for __NR_fsopen in the kernel source code, you will find the following macro constants: Apparently, this value should be used on the sample program side.

$ find arch/x86 -type f | grep \\.h | xargs egrep '__NR_fsopen|__NR_fsmount|__NR_fsconfig|__NR_move_mount' | grep unistd_64
arch/x86/include/generated/uapi/asm/unistd_64.h:#define __NR_move_mount 429
arch/x86/include/generated/uapi/asm/unistd_64.h:#define __NR_fsopen 430
arch/x86/include/generated/uapi/asm/unistd_64.h:#define __NR_fsconfig 431
arch/x86/include/generated/uapi/asm/unistd_64.h:#define __NR_fsmount 432

Preparation of necessary header files

The sample files also refer to header files, and you should use the header files included in Linux-5.9.3 instead of the header files installed in your Linux environment.

Here, the following procedure was used to create the include/linux and uapi directories in the same location as the sample program, and place the necessary header files there.

$ cd /usr/src/samples/vfs/
$ mkdir -p include/linux
$ cp -r ../../include/linux/ ./include/
$ cp -r ../../include/uapi/ ./uapi

Sample build

You can compile the sample program by following the steps below (although you will get a warning at compile time ...).

$ gcc -o test-fsmount test-fsmount.c -I./uapi -I./include/

Creating an ext4 filesystem for the sample

Create an ext4 file system on / dev/sdb as the file system referenced byfsopen (2)in the sample program.

$ dmesg | grep sdb
[    4.375191] sd 3:0:0:0: [sdb] 30714 512-byte logical blocks: (15.7 MB/15.0 MiB)
[    4.375750] sd 3:0:0:0: [sdb] Write Protect is off
[    4.376263] sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    4.376284] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    4.385157] sd 3:0:0:0: [sdb] Attached SCSI disk
$ sudo mkfs.ext4 /dev/sdb
$ sudo mount /dev/sdb /mnt
$ df -h | egrep 'File|mnt'
File system size used Remaining used%Mount position
/dev/sdb          14M  252K   13M    2% /mnt
$
$ sudo umount /mnt

Try running the sample program

Before running the program, let's take a quick look at the sample code. Get the file descriptor with fsopen (2), set the file system parameters with E_fsconfig (calling fsconfig (2) in the function),fsmount (2)andmove_mount (2) The flow is called mount of the file system with.

#define __NR_fsopen     430
#define __NR_fsmount    432
#define __NR_fsconfig   431
#define __NR_move_mount 429
...
int main(int argc, char *argv[])
{
        int fsfd, mfd;

        /* Mount a publically available AFS filesystem */
        fsfd = fsopen("ext4", FSOPEN_CLOEXEC);
        if (fsfd == -1) {
                perror("fsopen");
                exit(1);
        }

        E_fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sdb", 0);
        E_fsconfig(fsfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
        E_fsconfig(fsfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
        E_fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
        E_fsconfig(fsfd, FSCONFIG_SET_STRING, "sb", "1", 0);

        mfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_RDONLY|MOUNT_ATTR_NOATIME);
        if (mfd < 0)
                mount_error(fsfd, "fsmount");
        E(close(fsfd));

        if (move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH) < 0) {
                perror("move_mount");
                exit(1);
        }
...

However, when I run the sample program to check the behavior, fsmount (2) fails with "Invalid argument" ...

$ ./test-fsmount
fsmount: Invalid argument

Find the error part of fsmount (2)

The error message is an "Invalid argument", so the system call should return EINVAL. Let's find out the cause of the error by referring to the source code of fsmount (2).

/*
 * Create a kernel mount representation for a new, prepared superblock
 * (specified by fs_fd) and attach to an open_tree-like file descriptor.
 */
SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
        unsigned int, attr_flags)
{
...
    if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
        return -EINVAL;

    if (attr_flags & ~(MOUNT_ATTR_RDONLY |
               MOUNT_ATTR_NOSUID |
               MOUNT_ATTR_NODEV |
               MOUNT_ATTR_NOEXEC |
               MOUNT_ATTR__ATIME |
               MOUNT_ATTR_NODIRATIME))
        return -EINVAL;
...
    switch (attr_flags & MOUNT_ATTR__ATIME) {
    case MOUNT_ATTR_STRICTATIME:
        printk(KERN_WARNING "--> MOUNT_ATTR_STRICTATIME\n");
        break;
    case MOUNT_ATTR_NOATIME:
        printk(KERN_WARNING "--> MOUNT_ATTR_NOATIME\n");
        mnt_flags |= MNT_NOATIME;
        break;
    case MOUNT_ATTR_RELATIME:
        printk(KERN_WARNING "--> MOUNT_ATTR_RELATIME\n");
        mnt_flags |= MNT_RELATIME;
        break;
    default:
        printk(KERN_WARNING "--> default\n");
        return -EINVAL;
    }
...
    ret = -EINVAL;
    if (f.file->f_op != &fscontext_fops)
        goto err_fsfd;
...
    /* There must be a valid superblock or we can't mount it */
    ret = -EINVAL;
    if (!fc->root)
        goto err_unlock;
...
err_path:
    path_put(&newmount);
err_unlock:
    mutex_unlock(&fc->uapi_mutex);
err_fsfd:
    fdput(f);
    return ret;
}

There seem to be several places where it is possible to return EINVAL. Looking at it in order, it seems that it will be EINVAL when a value other than the specifiable value is passed for the argument. Regarding this point, only valid arguments are specified in the sample program to prevent an error in this part.

When I checked the passage with printk (), I found that EINVAL was returned by getting caught in the followingif (! Fc-> root). fc-> root refers to f.file-> private_data-> root and seems to point to the filesystem superblock to see the source code code comments.

    fc = f.file->private_data;
    ...
    /* There must be a valid superblock or we can't mount it */
    ret = -EINVAL;
    if (!fc->root)
        goto err_unlock;

... and so far, but since the file system of / dev/sdb passed byfsopen (2)is created by mkfs.ext4, the super block can be found. I can't understand why it doesn't exist ...

Since we couldn't investigate further this time, it seems that we will investigate the cause of EINVAL later until we build the sample program for the time being.

Summary

I tried the system call fsopen (2) provided by Linux-5.9.3. A sample program is available, but it needs to be read and modified according to the current implementation. Also, since the sample program itself is not working well, it seems that further investigation is necessary.

Recommended Posts

I tried a system call called fsopen (2), but I couldn't beat EINVAL ...
When I tried to make a VPC with AWS CDK but couldn't make it
I tried to implement a recommendation system (content-based filtering)
What is a system call
I tried to make a system that fetches only deleted tweets
I tried to draw a system configuration diagram with Diagrams on Docker
I tried using eval (a, b) for Fibonacci, but it wasn't fast