TriLUG meeting - Performance analysis

Thursday, 14 January 2016Aaron Schrab

Presentation by William Cohen of RedHat

Maximize ROI

Avoid guesses without data
Measure
- Representative benchmarks
- Informative metrics
Prioritize
Address issues with biggest impact

Typical tuning process

Pseudo code:


void optimize(int desired_perf) {
  setup_base_line_system();

  while (get_measurements() < desired_perf)  {
    modify_system();
  }
}

Dragonboard 410c

OS: Linux, Android, Windows 10
Quad-core ARM Cortex A53
1GB DRAM
8GB eMMC
Micro SD card slot
2 USB 2.0 connectors
Integrated WiFi and Bluetooth
HDMI output

Linux perf tool

User-space tool in Linux kernel sources
Fedora perf or Debian linux-tools packages
Available on wide variety of architectures
- x86_64, arm, powerpc, mips, aarch64, ...
Access to performance monitoring events
- Clock cycles
- Instructions executed
- Branches
- Various types of cache accesses and misses

Perf commands

Overall counts

perf stat command

Display stats for named command

Get sampling data

perf record command

writes to perf.data file

perf record --call-graph fp gzip systemtap.log

fp is frame pointer (architecture dependent)

Report on recorded data

perf report

Top-like data

perf top

Available events

perf list

-e cycles,instructions

Raw, hardware-specific events from hardware manuals. Look for chapter on performance monitoring.

-e r16,r17,r18

Mapping data back to source

Need debug info -g

Programmers's simple processor model

Finish each instruction before starting next
No shortcuts
- Actually get data from memory

Actual processor implementation

Pipelining
Branch prediction

`lstopo`

Display topology of system, cpu cores, caches.

Cache

Accessing main memory takes 100s of cycles, cache only a few.

Issues and possible fixes

Capacity misses
- Section loops to work on smaller sections of data
Conflict misses
- Change data layout so data ends up in different sets
False sharing
- Layout data so items in separate cache lines
Flushes and invalidations
- Avoid intermixing data and code, minimize use of mmap

Translation Lookaside Buffers (TLB)

Translating virtual to physical address is expensive

TLB caches this mapping to speed things up.

Each entry in TLB maps virtual page to physical page.

Fixes for issues

Group associated functions together on same page.

"Hot-cold" code section optimization.

Use Transparent Huge Page (THP) mechanism.

Pipelining

Allows overlapping execution of instructions

Issues

Some instructions might hold up pipeline.
Might have dependencies between instructions.
Branching might make it difficult to predict next instruction.

Possible fixes

Avoid slow instructions (e.g. shift rather than divide)
Unroll loops
- Larger sequence of instructions
- Compiler can better schedule dependent instructions
- Fewer branches