[LINUX] Kernel Self-Protection (1/2)

Originally, it is a part of the Linux Kernel source code, so it will be treated as GPLv2 (recognition that it should be).


Licensing documentation

The following describes the license of the Linux kernel source code (GPLv2), how to properly mark the license of individual files in the source tree, as well as links to the full license text.



Kernel Self-Protection

Kernel self-protection is the design and implementation of systems and structures within the Linux kernel to protect against security flaws in the kernel itself.

Kernel self-protection is a system and structure designed and implemented to protect against security flaws in the Linux Kernel itself.

This covers a wide range of issues, including removing entire classes of bugs, blocking security flaw exploitation methods, and actively detecting attack attempts.

It covers a wide range of issues. Eliminating the entire class of bugs, blocking ways to exploit security flaws, detecting aggressive attack attempts, and more.

Not all topics are explored in this document, but it should serve as a reasonable starting point and answer any frequently asked questions. (Patches welcome, of course!)

This document doesn't disclose all the topics, but it can be a good starting point or for some frequently asked questions (patches are welcome, of course!).


In the worst-case scenario, we assume an unprivileged local attacker has arbitrary read and write access to the kernel’s memory.

In the worst case scenario, it is assumed that an unauthorized local Scent of Incense field of view is given arbitrary read / write permissions to kernel memory.

In many cases, bugs being exploited will not provide this level of access, but with systems in place that defend against the worst case we’ll cover the more limited cases as well.

In many cases, even abused bugs do not provide access at this level. However, a system that can prevent the worst case can cover even more limited cases.

A higher bar, and one that should still be kept in mind, is protecting the kernel against a privileged local attacker, since the root user has access to a vastly increased attack surface.

To keep this idea in mind, a higher level is to protect the kernel from "privileged" local attackers. As a root user, you have access to a significantly increased attack area.

(Especially when they have the ability to load arbitrary kernel modules.)

(Especially if they can load any kernel module).

The goals for successful self-protection systems would be that they are effective, on by default, require no opt-in by developers, have no performance impact, do not impede kernel debugging, and have tests.

The success goal of self-protection systems is to be able to effectively, by default, test without asking the developer, without impacting performance, adversely affecting debugging, and testing. ..

It is uncommon that all these goals can be met, but it is worth explicitly mentioning them, since these aspects need to be explored, dealt with, and/or accepted.

It's not common to meet all of these goals, but it's worth mentioning this. The investigation must be carried out or accepted.

Attack Surface Reduction

The most fundamental defense against security exploits is to reduce the areas of the kernel that can be used to redirect execution.

The most basic protection against security abuse is to reduce the kernel space available for redirect execution.

This ranges from limiting the exposed APIs available to userspace, making in-kernel APIs hard to use incorrectly, minimizing the areas of writable kernel memory, etc.

This includes limiting the public APIs available in user space, making in-kernel APIs less likely to be accidentally used, minimizing writable kernel memory space, and much more. ..

Strict kernel memory permissions

When all of kernel memory is writable, it becomes trivial for attacks to redirect execution flow.

If all kernel memories are writable, it would be easier for an attacker to rediret the execution flow.

To reduce the availability of these targets the kernel needs to protect its memory with a tight set of permissions.

To reduce this potential, the kernel must protect its memory with strict permissions.

Executable code and read-only data must not be writable

Any areas of the kernel with executable memory must not be writable.

Areas of the kernel with executable memory must not be writable.

While this obviously includes the kernel text itself, we must consider all additional places too: kernel modules, JIT memory, etc.

This obviously includes the kernel text itself, but you also have to consider all the additional space. kernel module, JIT memory, etc.

(There are temporary exceptions to this rule to support things like instruction alternatives, breakpoints, kprobes, etc.

(There are exceptions to this rule, mostly to support instructions, breakpoints, kproves, etc.

If these must exist in a kernel, they are implemented in a way where the memory is temporarily made writable during the update, and then returned to the original permissions.)

If these are needed for the kernel, the memory will be temporarily writable during the update and will be implemented to revert to the original permissions once it is complete. )

In support of this are CONFIG_STRICT_KERNEL_RWX and CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not writable, data is not executable, and read-only data is neither writable nor executable.

To support this, there are CONFIG_STRICT_KERNEL_RWX and CONFIG_STRICT_MODULE_RWX. These make sure that the code is unwritable, the data is infeasible, and that data that can only be read cannot be written or executed.


Most architectures have these options on by default and not user selectable.

On many architectures these options are on by default and are not user-selectable.

For some architectures like arm that wish to have these be selectable, the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable a Kconfig prompt.

For some architectures, such as arm, you can choose according to your wishes. architecure Kconfig allows you to enable ARCH_OPTIONAL_KERNEL_RWX with the Kconfig prompt.

CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.


Function pointers and sensitive variables must not be writable

Vast areas of kernel memory contain function pointers that are looked up by the kernel and used to continue execution

A large area of kernel memory contains kernel pointers that are retrieved by the kernel and used to continue execution.

(e.g. descriptor/vector tables, file/network/etc operation structures, etc).

(For example, descriptor / vector tables, file / network / etc operation structures, etc.)

The number of these variables must be reduced to an absolute minimum.

The number of these variables must be reduced to the absolute minimum.


Many such variables can be made read-only by setting them “const” so that they live in the .rodata section instead of the .data section of the kernel, gaining the protection of the kernel’s strict memory permissions as described above.

Many of these variables can be read-only by setting const. This will place it in the .rodata section instead of the kernel's .data section. It is protected by the strict memory permissions of the kernel.


For variables that are initialized once at __init time, these can be marked with the (new and under development) __ro_after_init attribute.

For variables that are initialized once during __init, these can be marked by the __ro_after_init attribute (if new or under development).


What remains are variables that are updated rarely (e.g. GDT).

All that remains are variables that are rarely updated (for example, GDT).

These will need another infrastructure (similar to the temporary exceptions made to kernel code mentioned above) that allow them to spend the rest of their lifetime read-only.

These require another mechanism to make them read-only available during their lifetime. (Similar to the temporary exception in the kernel code shown above).

(For example, when being updated, only the CPU thread performing the update would be given uninterruptible write access to the memory.)

(For example, at the time of update, the CPU thread is given uninterrupted write permission to memory).

Segregation of kernel memory from userspace memory

The kernel must never execute userspace memory.

The kernel must not run user space memory.

The kernel must also never access userspace memory without explicit expectation to do so.

The kernel must also not access user space memory without explicit expectation.

These rules can be enforced either by support of hardware-based restrictions (x86’s SMEP/SMAP, ARM’s PXN/PAN) or via emulation (ARM’s Memory Domains).

This rule is enforced by a hardware-based mechanism (SMEP / SMAP for x86, PXN / PAN for ARM). Also, emulation (Memory Domains for ARM).

By blocking userspace memory in this way, execution and data parsing cannot be passed to trivially-controlled userspace memory, forcing attacks to operate entirely in kernel memory.

This blocks user-space memory, which makes it impossible to pass easily controllable user-space memory as execution or data, making it impossible to attack in kernel memory at all.

Reduced access to syscalls

One trivial way to eliminate many syscalls for 64-bit systems is building without CONFIG_COMPAT. However, this is rarely a feasible scenario.

One easy way to disable many syscalls on a 64-bit system is to build without CONFIG_COMPAT. However, this is an almost unrealistic scenario.


The “seccomp” system provides an opt-in feature made available to userspace, which provides a way to reduce the number of kernel entry points available to a running process.

The "seccom" system provides an opt-in feature that enables user space. This provides a way to reduce the number of kernel entry points for running processes.

This limits the breadth of kernel code that can be reached, possibly reducing the availability of a given bug to an attack.

This may limit the reachable range of kernel code and reduce the usefulness of the bugs given in attacking.


An area of improvement would be creating viable ways to keep access to things like compat, user namespaces, BPF creation, and perf limited only to trusted processes.

The improved area provides a means to continue to enable access to things like compat, user namespaces BPF creation, perf to reliable processes.

This would keep the scope of kernel entry points restricted to the more regular set of normally available to unprivileged userspace.

This also limits the scope of the kernel entry point to the standard set normally available to unauthorized user spaces.

Restricting access to kernel modules

The kernel should never allow an unprivileged user the ability to load specific kernel modules, since that would provide a facility to unexpectedly extend the available attack surface.

The kernel must not allow unauthorized users to load specific kernel modules. This provides the ability to extend the effectiveness of attackable aspects that you never imagined.

(The on-demand loading of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is considered “expected” here, though additional consideration should be given even to these.)

(The modules that specify "MODULE_ALIAS_ *" and are loaded as needed are the previously expected subsytems, which also require additional consideration.)

For example, loading a filesystem module via an unprivileged socket API is nonsense: only the root or physically local user should trigger filesystem module loading.

For example, it doesn't make sense to call the filesystem module while looking at an unauthorized socket API. Only root or physically local users should be able to load the filesystem module.

(And even this can be up for debate in some scenarios.)

And there is room to discuss some scenarios.


To protect against even privileged users, systems may need to either disable module loading entirely (e.g. monolithic kernel builds or modules_disabled sysctl), or provide signed modules (e.g. CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having root load arbitrary kernel code via the module loader interface.

The system may need to completely disable module loading (build as a monolithic kernel, or modules_disable sysctl) to protect it from privileged users as well. Alternatively, you need to utilize a signed module (CONFIG_MODULE_SIG_FORCE and dm-crypt with LoadPing). This prevents root from loading any kernel code through any module loader interface.

Memory integrity

There are many memory structures in the kernel that are regularly abused to gain execution control during an attack,

The kernel has a commonly used memory structure to gain execution control during an attack.

By far the most commonly understood is that of the stack buffer overflow in which the return address stored on the stack is overwritten.

The most commonly understood so far is a stack buffer overflow, which overwrites the return address stored on the stack.

Many other examples of this kind of attack exist, and protections exist to defend against them.

There are many other examples of this type of attack, and there are protections to protect them.

Stack buffer overflow

The classic stack buffer overflow involves writing past the expected end of a variable stored on the stack, ultimately writing a controlled value to the stack frame’s stored return address.

In the conventional stack buffer overflow, by writing beyond the end of the yukata seller of the variable held in the stack, it finally tries to write the controlled address to the return address stored in the stack frame.

The most widely used defense is the presence of a stack canary between the stack variables and the return address (CONFIG_STACKPROTECTOR), which is verified just before the function returns.

The most widely used defense is to provide a stack canary between the stack variable and the return address (CONFIG_STACKPROTECTOR). This is confirmed just before the function returns.

Other defenses include things like shadow stacks.

Another safeguard is to include a shadow stack.

Stack depth overflow

A less well understood attack is using a bug that triggers the kernel to consume stack memory with deep function calls or large stack allocations.

A lesser-known attack is to use a bug that triggers the kernel to consume stack memory by deep function calls or huge stack allocation.

With this attack it is possible to write beyond the end of the kernel’s preallocated stack space and into sensitive structures.

This attack allows you to write to a sensitive structure beyond the kernel's pre-allocated stack space termination.

Two important changes need to be made for better protections: moving the sensitive thread_info structure elsewhere, and adding a faulting memory hole at the bottom of the stack to catch these overflows.

Two modifications are needed for better protection. Moving the sensitive thread_info structure to another location, and adding a memory hole under the stack in case of failure to catch the overflow.

Heap memory integrity

The structures used to track heap free lists can be sanity-checked during allocation and freeing to make sure they aren’t being used to manipulate other memory areas.

The structure that tracks the heap free lists can be checked for health during allocation and release to verify that it is not being used by other memory area operations.

Counter integrity

Many places in the kernel use atomic counters to track object references or perform similar lifetime management.

In many parts of the kernel, atomic counters are used to track object references and perform lifetime management.

When these counters can be made to wrap (over or under) this traditionally exposes a use-after-free flaw.

When this counter wraps (excessive or undersized), it causes a use-after-free flaw for a long time.

By trapping atomic wrapping, this class of bug vanishes.

By trapping atomic wrapping, this kind of bug disappears.

Size calculation overflow detection

Similar to counter overflow, integer overflows (usually size calculations) need to be detected at runtime to kill this class of bug, which traditionally leads to being able to write past the end of kernel buffers.

Similar to counter overflow, integer overflow (usually size calculation) should be detected at runtime to remove this kind of bug. Traditionally, it could write beyond the end of the kernel buffer.

Continue to the second half ...

It's too long, so the rest is the second half.

Recommended Posts

Kernel Self-Protection (1/2)
Kernel Self-Protection (2/2)
k-means and kernel k-means
Kernel mode NEON
Kernel SVM (make_circles)