"What Every Programmer Should Know About Memory", Ulrich Drepper, 2007 http://www.akkadia.org/drepper/cpumemory.pdf
Abstract As CPU cores become both faster and more numerous, the limiting factor for most programs is now, and will be for some time, memory access. Hardware designers have come up with ever more sophisticated memory handling and acceleration techniques –such as CPU caches– but these cannot work optimally without some help from the programmer. Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer or the caches on CPUs is well understood by most programmers. This paper explains the structure of memory subsystems in use on modern commodity hardware, illustrating why CPU caches were developed, how they work, and what programs should do to achieve optimal performance by utilizing them.
(Referenzübersetzung)
** Übersicht ** Da CPU-Kerne schneller und mehr Kerne werden, wird der Speicherzugriff für die meisten Programme jetzt und auf absehbare Zeit ein begrenzender Faktor sein. Hardware-Designer haben ausgefeiltere Techniken zur Speicherbehandlung und -beschleunigung entwickelt, z. B. CPU-Caches. Sie benötigen jedoch die Hilfe von Programmierern, um diese nutzen zu können. Leider verstehen die meisten Programmierer die Struktur und die Kosten der Verwendung von Computerspeichersystemen und CPU-Caches nicht vollständig. In diesem Artikel werde ich die Struktur des Speichersubsystems erläutern, das heutzutage in der allgemeinen Hardware verwendet wird, z. B. warum der CPU-Cache entwickelt wird und wie er funktioniert, und sie verwenden, um die Leistung zu maximieren. Hier ist, was der Programmierer tun sollte, um dies zu tun.
** Inhaltsverzeichnis ** (bis Stufe 2) 1 Introduction 2 Commodity Hardware Today 2.1 RAM Types 2.2 DRAM Access Technical Details 2.3 Other Main Memory Users 3 CPU Caches 3.1 CPU Caches in the Big Picture 3.2 Cache Operation at High Level 3.3 CPU Cache Implementation Details 3.4 Instruction Cache 3.5 Cache Miss Factors 4 Virtual Memory 4.1 Simplest Address Translation 4.2 Multi-Level Page Tables 4.3 Optimizing Page Table Access 4.4 Impact Of Virtualization 5 NUMA Support 5.1 NUMA Hardware 5.2 OS Support for NUMA 5.3 Published Information 5.4 Remote Access Costs 6 What Programmers Can Do 6.1 Bypassing the Cache 6.2 Cache Access 6.3 Prefetching 6.4 Multi-Thread Optimizations 6.5 NUMA Programming 7 Memory Performance Tools 7.1 Memory Operation Profiling 7.2 Simulating CPU Caches 7.3 Measuring Memory Usage 7.4 Improving Branch Prediction 7.5 Page Fault Optimization 8 Upcoming Technology 8.1 The Problem with Atomic Operations 8.2 Transactional Memory 8.3 Increasing Latency 8.4 Vector Operations A Examples and Benchmark Programs A.1 Matrix Multiplication A.2 Debug Branch Prediction A.3 Measure Cache Line Sharing Overhead B Some OProfile Tips B.1 Oprofile Basics B.2 How It Looks Like B.3 Starting To Profile C Memory Types D libNUMA Introduction E Index F Bibliography G Revision History