linux (kernel) source analysis: system call entry function definition

Number of editions to be surveyed

linux-4.2(mainline)

System call entry function definition

The title was "Definition of system call entry function", but in reality, I am writing a macro that defines the system call entry function. (I haven't written about the contents of the system call entry function)

How the definition of the __SYSCALL_DEFINEx macro I wrote in the previous article (I looked at system call entries (linux source analysis)) expanded I tried to find out if it would be done.

This macro defines a system call entry function, which was quite esoteric. However, I think there were quite a few implementations that would raise the level of C language skills.

First, let's take a look at the setgid system call code as a sample where macros are actually used. kernel\sys.c(386): SYSCALL_DEFINE1(setgid, gid_t, gid)

/*
 * setgid() is implemented like SysV w/ SAVED_IDS
 *
 * SMP: Same implicit races as above.
 */
SYSCALL_DEFINE1(setgid, gid_t, gid)
{
	struct user_namespace *ns = current_user_ns();
	const struct cred *old;
	struct cred *new;
	int retval;
	kgid_t kgid;

	kgid = make_kgid(ns, gid);
	if (!gid_valid(kgid))
		return -EINVAL;

	new = prepare_creds();
	if (!new)
		return -ENOMEM;
	old = current_cred();

	retval = -EPERM;
	if (ns_capable(old->user_ns, CAP_SETGID))
		new->gid = new->egid = new->sgid = new->fsgid = kgid;
	else if (gid_eq(kgid, old->gid) || gid_eq(kgid, old->sgid))
		new->egid = new->fsgid = kgid;
	else
		goto error;

	return commit_creds(new);

error:
	abort_creds(new);
	return retval;
}

Looking at the line SYSCALL_DEFINE1 (setgid, gid_t, gid), the SYSCALL_DEFINE1 macro has three arguments, In reality, the argument for setgid is one of gid_t gid. As you can see from the macro definition, the basic macro implementation is the same for one or more arguments.

Related macro definition

The related macro definitions are as follows. (These macros are defined in include \ linux \ syscalls.h)

#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)

#define SYSCALL_DEFINEx(x, sname, ...)				\
	SYSCALL_METADATA(sname, x, __VA_ARGS__)			\
	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
#define __SYSCALL_DEFINEx(x, name, ...)					\
	asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))	\
		__attribute__((alias(__stringify(SyS##name))));		\
	static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__));	\
	asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__));	\
	asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))	\
	{								\
		long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));	\
		__MAP(x,__SC_TEST,__VA_ARGS__);				\
		__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__));	\
		return ret;						\
	}								\
	static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))

SYSCALL_METADATA seemed to have an auxiliary (?) Meaning, so I decided not to look at it here. In this article, we'll focus on how system call entry functions are expanded. </ del> SYSCALL_METADATA is a macro for embedding metadata inside a system call entry function. (Code that embeds metadata is valid only when the CONFIG_FTRACE_SYSCALLS macro is defined (config setting)) In this article, we will focus on how the system call entry function is expanded, so we will exclude SYSCALL_METADATA.

Looking from the head of \ _ \ _ SYSCALL_DEFINEx, In addition, there are \ _ \ _ MAP, \ _ \ _ SC_DECL, and \ _ \ _ stringify macros. These definitions are as follows.

#define __MAP0(m,...)
#define __MAP1(m,t,a) m(t,a)
#define __MAP2(m,t,a,...) m(t,a), __MAP1(m,__VA_ARGS__)
#define __MAP3(m,t,a,...) m(t,a), __MAP2(m,__VA_ARGS__)
#define __MAP4(m,t,a,...) m(t,a), __MAP3(m,__VA_ARGS__)
#define __MAP5(m,t,a,...) m(t,a), __MAP4(m,__VA_ARGS__)
#define __MAP6(m,t,a,...) m(t,a), __MAP5(m,__VA_ARGS__)
#define __MAP(n,...) __MAP##n(__VA_ARGS__)

#define __SC_DECL(t, a)	t a
#define __TYPE_IS_L(t)	(__same_type((t)0, 0L))
#define __TYPE_IS_UL(t)	(__same_type((t)0, 0UL))
#define __TYPE_IS_LL(t) (__same_type((t)0, 0LL) || __same_type((t)0, 0ULL))
#define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a
#define __SC_CAST(t, a)	(t) a
#define __SC_ARGS(t, a)	a
#define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))
include\linux\stringify.h(9): #define __stringify_1(x...)	#x
include\linux\stringify.h(10): #define __stringify(x...)	__stringify_1(x)

To specifically expand the macro, let's look at the case of setgid. setgid is written as SYSCALL_DEFINE1 (setgid, gid_t, gid). In this example, let's see how the macro expands.

Expanding the SYSCALL_DEFINE1 macro results in SYSCALL_DEFINEx (1, _setgid, gid_t, gid) Expanding the SYSCALL_DEFINEx macro and excluding the SYSCALL_METADATA macro part results in __SYSCALL_DEFINEx (1, _setgid, gid_t, gid). Furthermore, when the __SYSCALL_DEFINEx macro is expanded, it becomes as follows.

/*①*/
asmlinkage long sys_setgid(__MAP(1,__SC_DECL,gid_t,gid))
	__attribute__((alias(__stringify(SyS_setgid))));
/*②*/
static inline long SYSC_setgid(__MAP(1,__SC_DECL,gid_t,gid));	
/*③*/
asmlinkage long SyS_setgid(__MAP(1,__SC_LONG,gid_t,gid));
/*④*/
asmlinkage long SyS_setgid(__MAP(1,__SC_LONG,gid_t,gid))
{
	/*⑤*/
	long ret = SYSC_setgid(__MAP(1,__SC_CAST,gid_t,gid));
	/*⑥*/
	__MAP(1,__SC_TEST,gid_t,gid);
	/*⑦*/
	__PROTECT(1, ret,__MAP(1,__SC_ARGS,gid_t,gid));
	return ret;
}
static inline long SYSC_setgid(__MAP(1,__SC_DECL,gid_t,gid))	/*⑧*/

Let's look at the above contents from the top (link to the circled numbers written in the comments).

Definition contents of ①

If you expand only the __MAP (1, __SC_DECL, gid_t, gid part, It becomes __MAP1 (__ SC_DECL, gid_t, gid), and further becomes __SC_DECL (gid_t, gid), Eventually it will be gid_t gid. If you apply this to ①, asmlinkage long sys_setgid(gid_t gid)) __attribute__((alias("SyS_setgid"))); Expanded to.

Now consider the \ _ \ _MAP macro. The \ _ \ _MAP macro expands the type and variable name with the macro of the second argument, separates them with commas, and expands the expanded contents by the number of combinations of the first argument. It is connected. In the case of setgid, the number of combinations is 1, so it is difficult to understand as an example, The macro __SC_DECL of the second argument connects the type and the variable name separated by spaces, so connect them separated by commas (if the number of combinations is larger than 1). As a result, this expands to the variable definition part of the function. (The enumerated patterns expanded by the macro of the second argument change)

② Macro expansion

The next line (2) is based on what we have seen above. Expanded to static inline long SYSC_setgid (gid_t gid);.

③ Macro expansion

In ③, the macro of the second argument of the \ _ \ _MAP macro is \ _ \ _ SC_LONG. The contents of this \ _ \ _ SC_LONG macro are as follows.

\ _ \ _ SC \ _LONG Macro definition

#define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a
#define __TYPE_IS_LL(t) (__same_type((t)0, 0LL) || __same_type((t)0, 0ULL))
# define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))

The \ _ \ _ same_type macro is a macro that returns whether the types of arguments a and b match. In the \ _ \ _TYPE_IS_LL macro, the type of 0LL matches the cast of 0 with the argument t, or A macro that returns whether the type of 0ULL matches the cast of 0 with the argument t. LL and ULL are integer representations of long long and ʻunsigned long long, respectively. So \ _ \ _TYPE_IS_LL returns true if the type of the argument t is either long long or ʻunsigned long long, otherwise it returns false. Looking so far, finally the definition of \ _ \ _ SC_LONG. \ _ \ _ Builtin_choose_expr is a gcc built-in function that returns the value of the second argument if the evaluation value of the first argument is true and the value of the third argument if it is false. (The difference from the ternary operator is probably whether it is evaluated at build time or at runtime) </ del> (\ _ \ _ builtin_choose_expr also seems to be evaluated at runtime. The difference from the ternary operator is unknown) </ del> (\ _ \ _ Builtin_choose_expr remains until it passes through the preprocessor, but only the true code (second argument) is compiled, the fake code (third argument) is not compiled ⇒ The condition of the first argument is determined by the compiler Only the const value that can be specified can be specified) From these things, the \ _ \ _ SC_LONG macro is of type 0LL (long long) if the argument t is either long long or ʻunsigned long long, otherwise it is of type 0L. Define the argument a with (long`).

Based on the above, looking at the definition of ③, gid_t is ʻunsigned int, so __SC_LONG (gid_t, gid) is Expanded to long gid`. At this point, it's no longer just like chasing macros.

Finally, when the macro of ③ is expanded, ʻAsm linkage long SyS_setgid (long gid);`

④ Macro expansion

This is the same as the development of ③, Expanded like ʻasmlinkage long SyS_setgid (long gid)`.

⑤ Macro expansion

In ⑤, the macro of the second argument of the \ _ \ _MAP macro is __SC_CAST. The contents of this \ _ \ _ SC_CAST macro are as follows.

#define __SC_CAST(t, a)	(t) a

The \ _ \ _ SC_CAST macro is expanded to a syntax that casts the argument a with the first argument t. Since this is the second argument of \ _ \ _MAP, ⑤ is expanded as follows. long ret = SYSC_setgid((gid_t)gid); The SyS_setgid function receives a gid as a long type and casts it to gid_t by calling SYSC_setgid.

⑥ Macro expansion

In ⑥, the macro of the second argument of the \ _ \ _MAP macro is __SC_TEST. The contents of this \ _ \ _ SC_TEST macro are as follows.

\ _ \ _ SC_TEST Macro definition

#define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))

\ _ \ _TYPE_IS_LL is as I wrote above. (In the "\ _ \ _TYPE_IS_LL macro, the type of 0LL matches the cast of 0 with the argument t, or A macro that returns whether the type of 0ULL matches the cast of 0 with the argument t. ")

#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))

The BUILD_BUG_ON_ZERO macro is extremely elaborate. I can't write this easily. First, the !! (e) part is the first negation of the value of e, and if the value of e is other than 0, it will be all 1, so With two negations, if the value of e is other than 0, it becomes 1, and if it is 0, it becomes 0. struct {int:-!! (e);} defines a bitfield in the struct. If e is 0, the bitfield size will be 0, but if e is other than 0, the bitfield size will be -1. A compile error will occur. BUILD_BUG_ON_ZERO checks for errors at compile time like this. Since it does not actually define the structure, it is enclosed in sizeof.

If you forcibly expand the macro of ⑥ (expand in a simplified form), (sizeof(struct { int:-!!(0); }))。 If you try to write \ _ \ _builtin_types_compatible_p in the expanded contents, it will be incomprehensible, so please forgive me. After all, if you write what the \ _ \ _ SC_TEST macro does, When the type of the argument t is other than long long and ʻunsigned long long` and the size is larger than sizeof (long) Make a compilation error. It's like checking for changes / additions to system call entries at compile time.

⑦ Macro expansion

\ _ \ _ SC_ARGS is specified in the second argument of the \ _ \ _MAP macro. The definition of the \ _ \ _ SC_ARGS macro is as follows.

#define __SC_ARGS(t, a)	a

Only the argument name is expanded. The definition of the \ _ \ _ PROTECT macro is as follows.

#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
#define asmlinkage_protect(n, ret, args...) \
	__asmlinkage_protect##n(ret, ##args)
#define __asmlinkage_protect_n(ret, args...) \
	__asm__ __volatile__ ("" : "=r" (ret) : "0" (ret), ##args)
#define __asmlinkage_protect0(ret) \
	__asmlinkage_protect_n(ret)
#define __asmlinkage_protect1(ret, arg1) \
	__asmlinkage_protect_n(ret, "m" (arg1))
#define __asmlinkage_protect2(ret, arg1, arg2) \
	__asmlinkage_protect_n(ret, "m" (arg1), "m" (arg2))
#define __asmlinkage_protect3(ret, arg1, arg2, arg3) \
	__asmlinkage_protect_n(ret, "m" (arg1), "m" (arg2), "m" (arg3))
#define __asmlinkage_protect4(ret, arg1, arg2, arg3, arg4) \
	__asmlinkage_protect_n(ret, "m" (arg1), "m" (arg2), "m" (arg3), \
			      "m" (arg4))
#define __asmlinkage_protect5(ret, arg1, arg2, arg3, arg4, arg5) \
	__asmlinkage_protect_n(ret, "m" (arg1), "m" (arg2), "m" (arg3), \
			      "m" (arg4), "m" (arg5))
#define __asmlinkage_protect6(ret, arg1, arg2, arg3, arg4, arg5, arg6) \
	__asmlinkage_protect_n(ret, "m" (arg1), "m" (arg2), "m" (arg3), \
			      "m" (arg4), "m" (arg5), "m" (arg6))

If you proceed with the expansion in the case of setgid, __PROTECT(1, ret,__MAP(1,__SC_ARGS,gid_t,gid));__PROTECT(1, ret,gid);asmlinkage_protect(1,ret,gid);asmlinkage_protect_1(ret,gid);asmlinkage_protect_n(ret, "m" (gid));__asm__ __volatile__ ("" : "=r" (ret) : "0" (ret), "m" (gid));

I was able to deploy it, but I have no idea what this means. When I searched on the net, I found the following site.

According to this article, writing asm statements like the one above is called extended assembly syntax. The description is as follows.

\ _ \ _ asm \ _ \ _ (assembly template : Output operand / * option * / : Input operand / * option * / : List of registers to be destroyed / * Options * / ) In the assembly template, assembler instructions are written, but in the case of extended assembly syntax, how to write registers etc. is slightly different from how to write basic assembler instructions. Each operand can be a C language expression enclosed in operand constraint characters and parentheses.

The following is a sample written on the same site.

int in1=10,in2=5,out1; \ _ \ _ asm \ _ \ _ ("movl% 1, %% eax; / * Assembly template * / addl %2, %%eax; movl %%eax, %0;" : "= r" (out1) / * Output * / : "r" (in1), "r" (in2) / * Input * / : "% eax" / * Registers to be destroyed * / );

Based on the above, if you look at __asm__ __volatile__ ("": "= r" (ret): "0" (ret), "m" (gid)); , First, the "assembly template" part is empty. "Output" is ret and "Input" is ret and gid. Also, what are "= r", "0", and "m"? This was also explained in the reference site earlier. These are called constraint characters. -"R" (in the case of x86) is "The register to be used is automatically assigned from among eax, ebx, ecx, edx, esi, edi." • "0" indicates that "" 0 "is the 0th output operand constraint. For example, if the output operand is as follows, "= a" (in_out) is "0" and "= b" (in2) is "1".  :"=a"(in_out),"=b"(in2)」。 ⇒I see, it is a constraint of ret of the output operand. -"M" means "memory constraint", "memory constraint is stored directly in memory." ⇒ gid is an argument of SyS_setgid and should be on the stack. Is this a "memory constraint"?

Even if you look at it so far, I'm still not sure. Since the "assembly template" part is empty, there should be no asm statement to be executed. So it should make sense to specify variables in the "output" and "input" operands. Also, "By writing volatile, asm instructions can be removed, moved a lot, or combined into one. You can prevent that. I feel like an implementation related to gcc optimization. When I gathered more information online, I found the following site.

-asmlinkage_protect macro --Linux memorandum ... (to table of contents)

The information I just wanted to look up was written. However, I still don't fully understand it. In the case of the example we have followed so far, the call to the SYSC_setgid function is optimized to ensure that the contents of the argument gid can be referenced. I think it's a deterrent. Let's do a little more research and add any additional information to this article.

I was finally able to see the development of the macro, but I felt indigestion.

Next time, I'll look at the contents of the main body of fork, but I may take some time.

... After a little more research, I found the following site.

-Linux program system call made from 0, system call 2 system call in kernel

In this site

Normally the vfs_systemcall function is Stuck is used to pass arguments like a normal function, but in system calls It will be passed in the register. Since it is passed as a register, the argument is on the stack like a normal function. Works as if they were stacked. In the case of a system call, there are no arguments on the stack, so Calling a normal function will result in undefined behavior. This is called the tail call problem. Asmlinkage_protect to prevent this problem and ensure that arguments are passed to vfs_systemcall as registers I will.

It may have been written on the previous site, but I could understand it by looking at this description. If you call SYSC_setgid directly without passing through SyS_setgid, no arguments will be passed. Certainly, in the case of argument 0 like fork, asmlinkage_protect was not done.

Recommended Posts

linux (kernel) source analysis: system call entry function definition
linux (kernel) source analysis: system call call
I looked at system call entries. (linux source analysis)