TriLUG meeting - Performance analysis

William Cohen of RedHat

Maximize ROI

  • Avoid guesses without data
  • Measure
    • Representative benchmarks
    • Informative metrics
  • Prioritize
  • Address issues with biggest impact

Typical tuning process

Pseudo code:

void optimize(int desired_perf) {

  while (get_measurements() < desired_perf)  {

Dragonboard 410c

  • OS: Linux, Android, Windows 10
  • Quad-core ARM Cortex A53
  • 1GB DRAM
  • 8GB eMMC
  • Micro SD card slot
  • 2 USB 2.0 connectors
  • Integrated WiFi and Bluetooth
  • HDMI output

Linux perf tool

  • User-space tool in Linux kernel sources
  • Fedora perf or Debian linux-tools packages
  • Available on wide variety of architectures
    • x86_64, arm, powerpc, mips, aarch64, ...
  • Access to performance monitoring events
    • Clock cycles
    • Instructions executed
    • Branches
    • Various types of cache accesses and misses

Perf commands

Overall counts

perf stat command

Display stats for named command

Get sampling data

perf record command

writes to file

perf record --call-graph fp gzip systemtap.log

fp is frame pointer (architecture dependent)

Report on recorded data

perf report

Top-like data

perf top

Available events

perf list

-e cycles,instructions

Raw, hardware-specific events from hardware manuals. Look for chapter on performance monitoring.

-e r16,r17,r18

Mapping data back to source

Need debug info -g

Programmers's simple processor model

  • Finish each instruction before starting next
  • No shortcuts
    • Actually get data from memory

Actual processor implementation

  • Pipelining
  • Branch prediction


Display topology of system, cpu cores, caches.


Accessing main memory takes 100s of cycles, cache only a few.

Issues and possible fixes

  • Capacity misses
    • Section loops to work on smaller sections of data
  • Conflict misses
    • Change data layout so data ends up in different sets
  • False sharing
    • Layout data so items in separate cache lines
  • Flushes and invalidations
    • Avoid intermixing data and code, minimize use of mmap

Translation Lookaside Buffers (TLB)

Translating virtual to physical address is expensive

TLB caches this mapping to speed things up.

Each entry in TLB maps virtual page to physical page.

Fixes for issues

Group associated functions together on same page.

"Hot-cold" code section optimization.

Use Transparent Huge Page (THP) mechanism.


Allows overlapping execution of instructions


  • Some instructions might hold up pipeline.
  • Might have dependencies between instructions.
  • Branching might make it difficult to predict next instruction.

Possible fixes

  • Avoid slow instructions (e.g. shift rather than divide)
  • Unroll loops
    • Larger sequence of instructions
    • Compiler can better schedule dependent instructions
    • Fewer branches

Branch prediction

  • Predicts which instruction to feed the pipeline
  • Minimize waiting for result for conditional branch

Possible fixes for issues

  • Unrolling loops to reduce frequency of branches
  • Provide hints for likelihood of conditional
  • Convert short, unpredictable branches into straight-line code