C++ Go Fast

Based on What Every Programmer Should Know About Memory

What Programmers Can Do

Bypassing the Cache (p. 47)

non-temporal access(NTA) operations -> data will not be reused soon
less cache pollution

Cache Access

L1d (p. 49)

better locality:
- spatial:
  - use full cache line more effectively -> for loops
  - when binary generic -> biggest cache line size expected
  - put often used elements together + keep not together used data apart -> less cache pollution
  - variables used together -> store together -> fewer conflict misses
- temporal:
  - keep often used data together
  - pull common expressions out of inner loops as far down as possible
- restrict pointer -> promises that this pointer doesn't have any aliases
align code and data:
- struct layout:
  - pahole program -> make structs align to cache lines
  - move struct elements most likely to be the critical word to beginning
  - access elements of struct in order of definition -> arrange elements in most likely accessed order
  - for bigger structs -> apply rules for each cache line-sized block
- struct alignment:
  - each data type has own alignment requirement
  - structs use largest alignment of elements -> allocated object might not be aligned to cache line
  - can be changed with variable attribute for object allocated by compiler struct strtyype varialbe __attribute((aligned(64));
  - or struct strtype {...} __attribute(aligned(64));
  - posix_memalign and alignof for dynamic allocation
  - might lead to fragmentation
- alignment requirements:
  - variable length arrays (VLAs) -> active alignment control -> slower
  - relaxed alignment requirement for tail functions (call no other functions) and no multimedia operations + functions only calling functions without alignment requirement
  - -mpreferred-stack-boundary=2 -> stack will be aligned to 2^N bytes; default is N=4
  - can reduce code size; improve execution speed
SIMD

L1i (p. 55)

reduce code footprint:
- -Os good when loop unrolling and inlining not effective
- -finlinelimit -> when function too large for inlining
- noinline function attribute -> when function called often from different locations -> branch prediction may better -> branch prediction unit already saw code
linear without bubbles:
- loop unrolling, inlining -> -02/-O3
- always inline function only called once -> always_inline function attribute
branch prediction:
- jumps bad; instructions cached in decoded form
- jump target might not be static; even if static, memory fetch might miss all caches
- lopsided conditional execution -> false static branch prediction -> bubbles in pipeline
- profile-guided optimization
- #define unlikely(expr) __builtin_expect(!!expr), 0) #define likely(expr) __builtin_expect(!!expr), 1) + -freorder-blocks (used by -O2 + -O3 but not by -Os) doesn't work with exception handling
- compact loops have advantages
align code when sensible:
- instruction at beginning of cache line good
- alignment good when:
  - at beginning of functions
  - blocks only reachable by jumps
  - sometimes at beginning of loops -> requires gap -> filled with no-op instructions or unconditional jump
  - -falign-functions=N, -falign-jumps=N, -falign-loops=N -> align all functions, jumps, loops to next power-of-two greater than N -> gap of N bytes
  - N=1 -> disable alignment or -fno-align-functions

L2+

working set size for more than one use matched to cache size -> work on bigger problems in smaller pieces
hardcode cache line size
higher level cache size can vary widely -> cod must adjust dynamically
instructions and libraries use higher level cache too
/sys/devices/system/cpu/cpu*/cache for safe lower limit (p. 59)

TLB

page optimization for both page faulst and TLB misses
widely varying locations in address space -> more directories
Address Space Layout Randomization (ASLR) slow

Prefetching

Hardware Prefetching (p. 60)

at least two cache misses start prefetching
can't cross page boundaries
keep unrelated data out -> less cache pollution
sequential access -> hardware prefetching

Software Prefetching (p. 61)

_mm_prefetch intrinsic on pointer -> prefetch
-fprefetch-loop-arrays -> prefetch if sensible -> bad for small arrays

Speculation (p. 62)

Helper Threads (p. 63)

Direct Cache Access (p. 64)

Multi-Thread

Concurrency (p. 65)

goal: fewer RFO messages
grouping:
- group "constant" variables (change very rarely) -> always mark with const if possible
- group variables used together by same thread -> int bar __attribute__((section(".data.ro"))) = 2; -> ".data." prefix guarantees place together with other data sections; rest arbitrary
- separate read-only, read-write (and read-mostly)
- group read-write in struct -> only way of ensuring close memory locations
- read-write often written by different threads own cache line -> padding
TLS:
- thread-local storage (TLS) __thread int bar = 2; -> stored separately for each thread -> costs time when thread created
- thread-local variables are addressable by other threads but unless pointer passed, no way to find
- shift over more expensive -> copy required
- when only one thread uses variable -> wasted resources
- good when multiple threads independent use

Atomicity (p. 68)

Bit Test
Load Lock / Store Conditional (LL/SC)
Compare-and-Swap (CAS) -> expensive
Atomic Arithmetic -> still expensive (~200 cycles)

Bandwidth (p. 70)

place thread better -> thread affinity: assign thread to one or more cores

NUMA (p. 72)

Memory Performance Tools

Oprofil (p. 78)

stochastic, system-wide profiling -> not all events recorded accurately
relate multiple counter to each other
divisor is measure of processing time, number of clock cycles or number of instructions
CPI (cycle per instruction):
- not limited by memory bandwidth significantly below 1.0
- L1d not large enough -> 3.0
- L2 not sufficient -> 20.0
- average over all instructions
number of retired instructions used -> for cache miss rate load and store instructions have to be subtracted
inclusive caches -> L1d miss implies L2 miss as well
check if prefetch instructions issued too late (late NTA prefetches) or not used (1 - useful NTA prefetches + late NTA prefetches) -> wasted decoding of prefetch instruction

Opannotate (p. 80)

every event recorded with instruction pointer -> pinpoint hot spots

\time (p. 80)

list minor and major page faults
#include <sys/resource.h> int getrusage(__rusage_who_t who, struct rusage* usage) who can be RUSAGE_SELF or RUSAGE_CHILDREN -> can be used multiple times -> determine where page faults
can also be found in /proc/<PID/stat

Valgrind

Cachegrind (p. 81)

simulates different caches
valgrind --tool=cachegrind command arg -> use sizes and associativity of real processor running on
valgrind --tool=cachegrind --L2=8388608,8,64 command arg -> simulate 8MB L2 ache with 8-way set associativity and 64 byte cache line size
cachegrind.out.XXXXX with PID can be viewed with cg_annotate:
- cache line use in each function
- debug information required
- when source file given -> annotates each line
number of cache misses lower than in reality

Massif (p. 82)

valgrind --tool=massif command arg
creates massif.XXXXX.txt and massif.XXXXX.ps
--stacks=no -> disabling stack monitoring -> might be useful when not possible
aloc-fn=xmallox -> use custom allocator

Memusage (p. 83)

memusage command arg
total memory use for heap and optionally stack
-p IMGFILE -> create graph
faster than valgrind
-n NAME for selecting specific child program
dynamic memory allocation -> linear allocation and compactness
obstacks can be used when many allocations from same location -> large number of relatively small allocations
graph over number of calls gentle slope -> lot of small allocations

Profile Guided Optimization (p. 85)

test static assumptions
with gcc:
1. build with -fprofile-generate
2. run representative set of workloads output can be read with gcov
3. build again with -fprofile-use

Pagein (p. 86)

reduce total number of used pages
valgrind --tool=pagein command arg
valgrind produces some artifacts
add MAP_POPULATE to fourth parameter of mmap -> pre-fault all pages in mapped area
might clog up the system when only used much later
hugetlbfs for huge pages (like 2MB instead of 64KB)
Perf
Pahole
pfmon

oprofile

systemwide

Chapter 6.5 NUMA Programming is missing.

Requirements

python
- numpy
- pandas
- matplotlib
c/c++
- gcc/g++
OS
- linux

Plotting

make plot_all

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
aos_vs_soa		aos_vs_soa
restrict		restrict
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C++ Go Fast

What Programmers Can Do

Bypassing the Cache (p. 47)

Cache Access

L1d (p. 49)

L1i (p. 55)

L2+

TLB

Prefetching

Hardware Prefetching (p. 60)

Software Prefetching (p. 61)

Speculation (p. 62)

Helper Threads (p. 63)

Direct Cache Access (p. 64)

Multi-Thread

Concurrency (p. 65)

Atomicity (p. 68)

Bandwidth (p. 70)

NUMA (p. 72)

Memory Performance Tools

Oprofil (p. 78)

Opannotate (p. 80)

\time (p. 80)

Valgrind

Cachegrind (p. 81)

Massif (p. 82)

Memusage (p. 83)

Profile Guided Optimization (p. 85)

Pagein (p. 86)

oprofile

Requirements

Plotting

About

Contributors 2

Languages

License

christopher-besch/cpp_go_fast

Folders and files

Latest commit

History

Repository files navigation

C++ Go Fast

What Programmers Can Do

Bypassing the Cache (p. 47)

Cache Access

L1d (p. 49)

L1i (p. 55)

L2+

TLB

Prefetching

Hardware Prefetching (p. 60)

Software Prefetching (p. 61)

Speculation (p. 62)

Helper Threads (p. 63)

Direct Cache Access (p. 64)

Multi-Thread

Concurrency (p. 65)

Atomicity (p. 68)

Bandwidth (p. 70)

NUMA (p. 72)

Memory Performance Tools

Oprofil (p. 78)

Opannotate (p. 80)

\time (p. 80)

Valgrind

Cachegrind (p. 81)

Massif (p. 82)

Memusage (p. 83)

Profile Guided Optimization (p. 85)

Pagein (p. 86)

oprofile

Requirements

Plotting

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages