William Cohen of RedHat
Maximize ROI
- Avoid guesses without data
- Measure
- Representative benchmarks
- Informative metrics
- Prioritize
- Address issues with biggest impact
Typical tuning process
Pseudo code:
void optimize(int desired_perf) {
setup_base_line_system();
while (get_measurements() < desired_perf) {
modify_system();
}
}
Dragonboard 410c
- OS: Linux, Android, Windows 10
- Quad-core ARM Cortex A53
- 1GB DRAM
- 8GB eMMC
- Micro SD card slot
- 2 USB 2.0 connectors
- Integrated WiFi and Bluetooth
- HDMI output
Linux perf tool
- User-space tool in Linux kernel sources
- Fedora
perf
or Debianlinux-tools
packages - Available on wide variety of architectures
- x86_64, arm, powerpc, mips, aarch64, ...
- Access to performance monitoring events
- Clock cycles
- Instructions executed
- Branches
- Various types of cache accesses and misses
Perf commands
Overall counts
perf stat command
Display stats for named command
Get sampling data
perf record command
writes to perf.data
file
perf record --call-graph fp gzip systemtap.log
fp
is frame pointer (architecture dependent)
Report on recorded data
perf report
Top-like data
perf top
Available events
perf list
-e cycles,instructions
Raw, hardware-specific events from hardware manuals. Look for chapter on performance monitoring.
-e r16,r17,r18
Mapping data back to source
Need debug info -g
Programmers's simple processor model
- Finish each instruction before starting next
- No shortcuts
- Actually get data from memory
Actual processor implementation
- Pipelining
- Branch prediction
lstopo
Display topology of system, cpu cores, caches.
Cache
Accessing main memory takes 100s of cycles, cache only a few.
Issues and possible fixes
- Capacity misses
- Section loops to work on smaller sections of data
- Conflict misses
- Change data layout so data ends up in different sets
- False sharing
- Layout data so items in separate cache lines
- Flushes and invalidations
- Avoid intermixing data and code, minimize use of mmap
Translation Lookaside Buffers (TLB)
Translating virtual to physical address is expensive
TLB caches this mapping to speed things up.
Each entry in TLB maps virtual page to physical page.
Fixes for issues
Group associated functions together on same page.
"Hot-cold" code section optimization.
Use Transparent Huge Page (THP) mechanism.
Pipelining
Allows overlapping execution of instructions
Issues
- Some instructions might hold up pipeline.
- Might have dependencies between instructions.
- Branching might make it difficult to predict next instruction.
Possible fixes
- Avoid slow instructions (e.g. shift rather than divide)
- Unroll loops
- Larger sequence of instructions
- Compiler can better schedule dependent instructions
- Fewer branches
Branch prediction
- Predicts which instruction to feed the pipeline
- Minimize waiting for result for conditional branch
Possible fixes for issues
- Unrolling loops to reduce frequency of branches
- Provide hints for likelihood of conditional
- Convert short, unpredictable branches into straight-line code