"What Every Programmer Should Know About Memory", Ulrich Drepper, 2007 http://www.akkadia.org/drepper/cpumemory.pdf
Abstract As CPU cores become both faster and more numerous, the limiting factor for most programs is now, and will be for some time, memory access. Hardware designers have come up with ever more sophisticated memory handling and acceleration techniques –such as CPU caches– but these cannot work optimally without some help from the programmer. Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer or the caches on CPUs is well understood by most programmers. This paper explains the structure of memory subsystems in use on modern commodity hardware, illustrating why CPU caches were developed, how they work, and what programs should do to achieve optimal performance by utilizing them.
(Reference translation)
** Overview ** As CPU cores become faster and manycore, memory access will be a limiting factor for most programs now and for the foreseeable future. Hardware designers have devised more sophisticated memory handling and acceleration techniques, such as CPU caches, but they need help from programmers to take advantage of them. Unfortunately, most programmers do not fully understand the structure and cost of use of computer memory systems and CPU caches. This article describes the structure of memory subsystems used in common hardware these days, such as why CPU caches are developed and how they work, and use them to get the best performance. Here's what the programmer should do to do this.
** Table of Contents ** (up to Level 2) 1 Introduction 2 Commodity Hardware Today 2.1 RAM Types 2.2 DRAM Access Technical Details 2.3 Other Main Memory Users 3 CPU Caches 3.1 CPU Caches in the Big Picture 3.2 Cache Operation at High Level 3.3 CPU Cache Implementation Details 3.4 Instruction Cache 3.5 Cache Miss Factors 4 Virtual Memory 4.1 Simplest Address Translation 4.2 Multi-Level Page Tables 4.3 Optimizing Page Table Access 4.4 Impact Of Virtualization 5 NUMA Support 5.1 NUMA Hardware 5.2 OS Support for NUMA 5.3 Published Information 5.4 Remote Access Costs 6 What Programmers Can Do 6.1 Bypassing the Cache 6.2 Cache Access 6.3 Prefetching 6.4 Multi-Thread Optimizations 6.5 NUMA Programming 7 Memory Performance Tools 7.1 Memory Operation Profiling 7.2 Simulating CPU Caches 7.3 Measuring Memory Usage 7.4 Improving Branch Prediction 7.5 Page Fault Optimization 8 Upcoming Technology 8.1 The Problem with Atomic Operations 8.2 Transactional Memory 8.3 Increasing Latency 8.4 Vector Operations A Examples and Benchmark Programs A.1 Matrix Multiplication A.2 Debug Branch Prediction A.3 Measure Cache Line Sharing Overhead B Some OProfile Tips B.1 Oprofile Basics B.2 How It Looks Like B.3 Starting To Profile C Memory Types D libNUMA Introduction E Index F Bibliography G Revision History