[LINUX] vlocks for Bare-Metal Mutual Exclusion (2/2)

https://www.kernel.org/doc/html/latest/arm/vlocks.html

vlocks for Bare-Metal Mutual Exclusion

ARM implementation

The current ARM implementation [2] contains some optimisations beyond the basic algorithm:

The current ARM implementation [2] includes some optimizations in addition to the basic algorithms.

・ By packing the members of the currently_voting array close together, we can read the whole array in one transaction (providing the number of CPUs potentially contending the lock is small enough). This reduces the number of round-trips required to external memory.

· By packing the member variable of the currently_voting array, you can read the entire array in one transaction (if the number of CPUs that can conflict for locks is small enough). This can reduce the number of round-trips required in external memory.

In the ARM implementation, this means that we can use a single load and comparison:

In the ARM implementation, it means that it can be used in comparison with a single load.

LDR     Rt, [Rn]　　//Read the contents of address Rn 32bit to Rt
CMP     Rt, #0      //Determine if Rt matches 0

…in place of code equivalent to:

LDRB    Rt, [Rn]     //Read the contents of address Rn 8bit to Rt
CMP     Rt, #0       //Determine if Rt matches 0
LDRBEQ  Rt, [Rn, #1] //Rt to Rn+Read the contents of address 1 8bit
CMPEQ   Rt, #0       //Determine if Rt matches 0
LDRBEQ  Rt, [Rn, #2] //Rt to Rn+Read the contents of address 2 8bit
CMPEQ   Rt, #0       //Determine if Rt matches 0
LDRBEQ  Rt, [Rn, #3] //Rt to Rn+Read the contents of address 3 8bit
CMPEQ   Rt, #0       //Determine if Rt matches 0

This cuts down on the fast-path latency, as well as potentially reducing bus contention in contended cases.

This can reduce fast-path latency and reduce potential bus contention in competing cases.

The optimisation relies on the fact that the ARM memory system guarantees coherency between overlapping memory accesses of different sizes, similarly to many other architectures. Note that we do not care which element of currently_voting appears in which bits of Rt, so there is no need to worry about endianness in this optimisation.

The optimization relies on the fact that the ARM memory system guarantees coherency between duplicate memory accesses of different sizes. Like many other architectures. You don't have to worry about endianness with this optimization, as it doesn't matter which element of current_voting appears in which bit of Rt.

If there are too many CPUs to read the currently_voting array in one transaction then multiple transations are still required. The implementation uses a simple loop of word-sized loads for this case. The number of transactions is still fewer than would be required if bytes were loaded individually.

If the number of CPUs is so large that the currently_voting array cannot be read in one transaction, then multiple transactions are required. This implementation simply repeats the word-size load. The number of transactions is smaller than when read individually in byte units.

In principle, we could aggregate further by using LDRD or LDM, but to keep the code simple this was not attempted in the initial implementation.

Basically, by using LDRD and LDM, we were able to further consolidate. However, in the initial implementation, it was not adopted to keep the code simple.

・ Vlocks are currently only used to coordinate between CPUs which are unable to enable their caches yet. This means that the implementation removes many of the barriers which would be required when executing the algorithm in cached memory.

· Vlock is currently only used for inter-CPU coordination where caching cannot be enabled yet. This means that the implementation needs to remove many barriers to run the algorithm in cached memory.

packing of the currently_voting array does not work with cached memory unless all CPUs contending the lock are cache-coherent, due to cache writebacks from one CPU clobbering values written by other CPUs. (Though if all the CPUs are cache-coherent, you should be probably be using proper spinlocks instead anyway).

Currently_voting array packing works in cached memory unless locks on all CPUs conflict with cache-coherent due to writing values written from other CPUs back to the cache from another CPU. can not. (Even if all CPUs are cache-coherent, they should be using the proper spinlocks).

・ The “no votes yet” value used for the last_vote variable is 0 (not -1 as in the pseudocode). This allows statically-allocated vlocks to be implicitly initiallyised to an unlocked state simply by putting them in .bss.

The value of "no votes yet" used for the last_vote variable is 0 (not -1 as in pseudocode). This puts the statically placed vlock in the initial state, which is implicitly unlocked just by being placed in .bss.

An offset is added to each CPU’s ID for the purpose of setting this variable, so that no CPU uses the value 0 for its ID.

The CPU does not use the value 0 for the ID because the offset is added to each CPU's ID to set this variable.

Colophon

Originally created and documented by Dave Martin for Linaro Limited, for use in ARM-based big.LITTLE platforms, with review and input gratefully received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for grabbing most of this text out of the relevant mail thread and writing up the pseudocode.

References

[1] Lamport, L. “A New Solution of Dijkstra’s Concurrent Programming Problem”, Communications of the ACM 17, 8 (August 1974), 453-455. https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

[2] linux/arch/arm/common/vlock.S, www.kernel.org.

Originally, it is a part of the Linux Kernel source code, so it will be treated as GPLv2 (recognition that it should be).

https://www.kernel.org/doc/html/latest/index.html

Licensing documentation

The following describes the license of the Linux kernel source code (GPLv2), how to properly mark the license of individual files in the source tree, as well as links to the full license text.

https://www.kernel.org/doc/html/latest/process/license-rules.html#kernel-licensing