The Performance Addict's Toolbox

Peter Steinbach
(Scionics Computer Innovation GmbH)
steinbach@scionics.de

Nov 10, 2017, MeetingC++ 2017, Berlin

Real-life performance optimisation is never

as simple as this

more like this

Agenda

  1. Motivation
  2. Who-am-I
  3. Performance outside-in
  4. Performance inside-out
  5. Benchmarks and how to create them

Who-am-I and Motivation

Who am I?

Scionics Computer Innovation GmbH

  • software and consulting company
  • founded in 2001 in Dresden, Germany
  • expertise in data analysis, bioinformatics, image analysis, HPC, ...

Our Client

MPI for Molecular Cell Biology and Genetics
MPI for Molecular Cell Biology and Genetics

mpi-cbg.de

  • 500+ staff
  • my role: Scientific Software Engineer
  • support users on our HPC infrastructure
  • software projects related to performance (think multi-threaded, GPUs, ..)

Disclaimer

Before I begin

All of my slides assume, that the code provides correct results!

Nobody wants fast code, that is wrong!

Performance Outside-In

One day as a Performance Engineer

Alan O'Rourke, Too Busy To Improve - Performance Management - Square Wheels, CC
Alan O'Rourke, Too Busy To Improve - Performance Management - Square Wheels, CC

Once in a while

From: doe@theinstitute.de
Subject: Cluster is slow
Date: Fri, 20 Oct 2017 12:03:21 +0200
To: hpcsupport@theinstitute.de

Hi,

what is going on with the cluster? My application is running
slow since yesterday.
Could you have a look at it please?

Thanks,
John

Challenge: Finding the performance regression without looking at the code

High Level Overview

htop, free et al
htop, free et al

Reference Numbers

$ dd if=/dev/zero of=/tmp/just_zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 2.94478 s, 729 MB/s

$ dd if=/dev/zero of=/dev/shm/2gb.zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 1.14782 s, 1.9 GB/s

 

What can your hardware typically do?

dd, ior, memhog, stream, ...

Profile with perf

$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf report --stdio
no symbols found in /usr/bin/dd, maybe install a debug package?
# ...
# Total Lost Samples: 0
#
# Samples: 75  of event 'cycles:u'
# Event count (approx.): 1839654
#
# Children      Self  Command  Shared Object      Symbol           
# ........  ........  .......  .................  ................
#
    20.18%    20.18%  dd       [kernel.kallsyms]  [k] page_fault
            |          
             --19.77%--0
                       _int_realloc
                       page_fault
  • lightweight sample based profiling
  • per task, per CPU and per-workload counters
  • sample CPU performance counters, tracepoints or system probes
  • on windows: xperf/UIforETW

perf: what is a callstack?

Your browser does not support SVG

perf: sampling based profiling

Your browser does not support SVG

  • for every sampling event:
    • record call stack
    • query hardware counters, e.g. cpu-cycles
  • sampling must not be accurate

perf Reloaded with FlameGraphs

$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg
  • visualisation technique conceived by Brendan Gregg (Netflix)
  • seamless integration into perf, dtrace, systemtap, XCode Instruments, Lightweight Java Profiler, Microsoft Visual Studio profiles, ...
  • based on collected counter samples and the stacktrace they were collected in

Ethereum Mining as FlameGraph

Your browser does not support SVG

  • (x axis) current stack level in alphabetical order
  • (y axis) number of samples in that stacktrace level

HPC user's slow application

Your browser does not support SVG

Bottom Line

Taking a balloon to get an overview of performance bottlenecks is possible.

Performance Inside-Out

High Diversity of Tools!

Textual output, gprof

$ g++ -pg -O2 -std=c++11 vector_unroll_example.cpp
$ ./a.out
$ gprof ./a.out gmon.out > analysis.txt
$ head analysis.txt
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 26.71      1.02     1.02                             void my_axpy<6u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 26.71      2.05     1.02                             void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 23.83      2.96     0.91                             void my_axpy<8u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 23.04      3.84     0.88                             void my_axpy<4u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
  0.00      3.84     0.00        1     0.00     0.00  _GLOBAL__sub_I_main

Profile from Peter Gottschling's example on vector unrolling.

Simple Graphical output, perftools

Profile from Peter Gottschling's example on vector unrolling.

valgrind + kcachegrind

Profile from Peter Gottschling's example on vector unrolling.

Using flamegraphs, hotspot

Proprietary tools

Found a hot spot!

Danger zone of mental models

Inspect Assembly?

perf for hardware exploration?

$ perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  #...
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  #...
  • perf event list depends on kernel version
  • hardware counters are not portable (specification change by vendors)
  • alternative: ocperf

Test hypothesis with likwid

  • profiling through hardware counters (consistent meta markers for portability)
  • exploration through monitoring
  • marker API for C, C++, java and python

use case: Index Lists

#include <vector>
#include "omp.h"

struct item{
    std::vector<float> position, momentum;
    std::vector<int>   nearest_neighbors;}

int main(int argc, char** argv){
    std::vector<item> world = generate(argc*10e6);
    
    for(int& time_step : timelapse){
        update(world);
        
        #pragma omp parallel for
        for(item& it : world){
            for(int& index : it.nearest_neighbors){
                auto distance = calculate(it, world[index]);
                if(distance > threshold)
                    it.nearest_neighbors.remove(index);
            }}}
    //..
}
  • hypotheses:

    • large 'unpredictable' jumps in memory access diminishes cache bandwidth

    • false sharing forces cache line reloads as read-only and writable items may share the same cache line

Let's measure!

use case: Through Likwid

Use Case

# export OMP_NUM_THREADS=1
# path/to/likwid-perfctr -f -c 0 -g FALSE_SHARE \
numactl -m0 -C0 ./my_app
+----------------------------------|--------------+
|              Metric              |    Core 0    |
+----------------------------------|--------------+
//..
|  Local LLC false sharing [MByte] |       0.0008 |
|   Local LLC false sharing rate   | 5.608215e-10 |
//..
+----------------------------------|--------------+

# export OMP_NUM_THREADS=4
# path/to/likwid-perfctr -f -c 0-4 -g FALSE_SHARE \
numactl -m0 -C0-3 ./my_app
+---------------------------------------|--------------|
|                 Metric                |      Sum     |
+---------------------------------------|--------------|
#..
|  Local LLC false sharing [MByte] STAT |    2973.7637 |
|   Local LLC false sharing rate STAT   |       0.0081 |
#..
+---------------------------------------|--------------|

Stream Benchmark as Reference

# export OMP_NUM_THREADS=1
# path/to/likwid-perfctr -f -c 0 -g FALSE_SHARE \
numactl -m0 -C0 ./stream
+----------------------------------|--------------+
|              Metric              |    Core 0    |
+----------------------------------|--------------+
#..
|  Local LLC false sharing [MByte] |       0.0006 |
|   Local LLC false sharing rate   | 6.057282e-10 |
#..
+----------------------------------|--------------+

# export OMP_NUM_THREADS=4
# path/to/likwid-perfctr -f -c 0-4 -g FALSE_SHARE \
numactl -m0 -C0-3 ./stream
+---------------------------------------|--------------|
|                 Metric                |      Sum     |
+---------------------------------------|--------------|
#..
|  Local LLC false sharing [MByte] STAT |       0.1067 |
|   Local LLC false sharing rate STAT   | 4.080027e-07 |
#..
+---------------------------------------|--------------|

Bottom Line

  • excellent tools available to find hot spots
  • once "found", talk to someone
    (rubber duck or colleaque(s))
  • create falsifiable hypotheses
  • MEASURE!

Benchmarks and how to create them

faster code?

Chandler Carruth, Understanding Compiler Optimization, MeetingCPP 2015

Klaus Iglberger: Guys that do know a lot about performance, do a lot of manual unrolling (manual vectorization). Apparently they don't trust the compiler too much. What is your take on this?

Chandler: How do you define "poeple who know a lot about performance"? Serious question. So I work with Google's optimisation team who is responsible for making our C++ code run fast. And I have never seen them manually unroll a loop.

chrono is your friend

#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"

int main(int argc, char** argv){

    auto start = std::chrono::high_resolution_clock::now();
    auto result = production_code::algorithm();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    auto new_result = new_ideas::algorithm();
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
              << std::endl;
              
    return 0;
}

... check for correctness

#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"

int main(int argc, char** argv){

    auto start = std::chrono::high_resolution_clock::now();
    auto result = production_code::algorithm();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    auto new_result = new_ideas::algorithm();
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    if(result == new_result)
        std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
                  << std::endl;
    else
        std::cout << "Never mind!" << std::endl;
}

noisy lab under your fingers

#include ...

int main(int argc, char** argv){

    auto result = 0;
    auto new_result = 0;
    
    auto start = std::chrono::high_resolution_clock::now();

    for(int i = 0;i<n_repetitions;++i)
        result = production_code::algorithm();
   
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    
    for(int i = 0;i<n_repetitions;++i)
        new_result = new_ideas::algorithm();
        
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    if(result == new_result)
        std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
                  << std::endl;
    else
        std::cout << "Never mind!" << std::endl;
}

Please take notes

#include ...

using duration_t = std::chrono::duration<double>;

int main(int argc, char** argv){
    //..
    auto start = std::chrono::high_resolution_clock::now();
    auto end = start;
    std::vector<duration_t> my_timings(n_repetitions);

    for(int i = 0;i<n_repetitions;++i){
        start = std::chrono::high_resolution_clock::now();
        result = production_code::algorithm();
        my_timings[i] = std::chrono::high_resolution_clock::now() - start;
    }
   
    //same with new_result = new_ideas::algorithm()

    if(result == new_result){
        std::ofstream ofile("results.csv");ofile.open();
        for(int i = 0;i<n_repetitions;++i){
            ofile << i << ",production," << prod_timings[i].count() << ",seconds" << std::endl;
        }
        //same with new_idea
        ofile.close()
    }
    else
        std::cout << "Never mind!" << std::endl;
}

Why?

... It's a simple Python interface around a blazing fast C++ library ...

from github.com/vincentlaucsb/csvmorph

 

... However C++ code used to be significantly faster for a long time, and also today still is in many cases.

from SO "How much faster is C++ than C#?"

see more for youself!

Life as a reviewer

Let's take a toy example

... what if

Ensemble variances tell a story!
Ensemble variances tell a story!

Standardized, easy-to-parse output!

google/benchmark

 

  • written in C++11/C++03
  • support of multi-threaded applications
  • powerful CLI
  • easy setup of (templated) test cases
  • flexible argument control
  • custom counters/timers

benchmark: in action

Question:

Are range-based for loops faster than integer based ones depending on the data type used?

benchmark: simple approach

#include <benchmark/benchmark.h>
#include <vector>

template <typename T>
double sum(const T* _data, std::size_t _len){

    double value = 0;
    for(std::size_t i = 0;i<_len;++i)
        value += _data[i];

    return value;
}

template <typename container_type>
double sum(const container_type& _data){

    typedef typename container_type::value_type value_t;

    double value = 0;
    for(const value_t& el : _data)
        value += el;

    return value;
}
static void BM_integer_index(benchmark::State& state) {

    const std::size_t len = 1 << 20;
    std::vector<int> values(len,0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values.data(), len));
    }
}
// Register the function as a benchmark
BENCHMARK(BM_integer_index);

static void BM_range_based(benchmark::State& state) {

    const std::size_t len = 1 << 20;
    std::vector<int> values(len,0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values));
    }

}
BENCHMARK(BM_range_based);

BENCHMARK_MAIN();

benchmark: simple approach output

Run on (4 X 3600 MHz CPU s)
2017-11-08 10:24:43
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------
Benchmark                 Time           CPU Iterations
--------------------------------------------------------
BM_integer_index     922920 ns     915531 ns        764
BM_range_based       937344 ns     929681 ns        768

benchmark: advanced

template <typename T>
static void BM_integer_index(benchmark::State& state) {

    const std::size_t len = state.range(0);
    std::vector<T> values(len,0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values.data(), len));
    }
}

BENCHMARK_TEMPLATE(BM_integer_index,int)
->Arg(64)
->Arg(512)
->Arg(1 << 10)
->Arg(128<<10)
->Arg(1<<20)
->Arg(128<<20);
BENCHMARK_TEMPLATE(BM_integer_index,float)
->Arg(64)
->Arg(512)
->Arg(1 << 10)
->Arg(128<<10)
->Arg(1<<20)
->Arg(128<<20);

BENCHMARK_MAIN();
  • multiple arguments are also supported

    BENCHMARK_TEMPLATE(BM_integer_index,int)
    //42 is the initial value of the reduced sum
    ->Arg({64, 42})
    //..
    ;
  • templated benchmark cases are supported
  • workflow:

  1. build benchmark
    (different working set sizes, types)
  2. compile with varying flags
  3. run & inspect
  4. render report with rmarkdown

benchmark: advanced output

Run on (4 X 3600 MHz CPU s)
2017-11-08 10:25:27
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_integer_index<int>/64                  50 ns         50 ns   10000000
BM_integer_index<int>/512                424 ns        423 ns    1634359
BM_integer_index<int>/1024               864 ns        861 ns     772101
BM_integer_index<int>/131072          112125 ns     111777 ns       6344
BM_integer_index<int>/1048576         924382 ns     916717 ns        761
BM_integer_index<int>/134217728    123700290 ns  122614766 ns          6
BM_integer_index<float>/64                52 ns         51 ns   13374011
BM_integer_index<float>/512              427 ns        425 ns    1641151
BM_integer_index<float>/1024             863 ns        860 ns     772552
BM_integer_index<float>/131072        111322 ns     111160 ns       6109
BM_integer_index<float>/1048576       914593 ns     909174 ns        763
BM_integer_index<float>/134217728  122954355 ns  122219776 ns          6
BM_range_based<int>/64                    50 ns         50 ns   13580124
BM_range_based<int>/512                  429 ns        429 ns    1570356
BM_range_based<int>/1024                 866 ns        865 ns     815042
BM_range_based<int>/131072            111289 ns     111139 ns       6316
BM_range_based<int>/1048576           912475 ns     907277 ns        761
BM_range_based<int>/134217728      122509880 ns  121832332 ns          6
BM_range_based<float>/64                  48 ns         48 ns   13944707
BM_range_based<float>/512                426 ns        426 ns    1637659
BM_range_based<float>/1024               863 ns        862 ns     810743
BM_range_based<float>/131072          110915 ns     110775 ns       6343
BM_range_based<float>/1048576         917501 ns     912365 ns        735
BM_range_based<float>/134217728    122908318 ns  122219268 ns          6

benchmark: reproduce this!

gcc 6.4.1, libbbenchmark 1.3,  code available in this repo
gcc 6.4.1, libbbenchmark 1.3,
code available in this repo
  • file an issue if you reproduced this!
$ cd src/libbenchmark
$ //install libbenchmark & tidyverse R package
$ CXXFLAGS=-O2 make report

There is more

nickbruun/hayai 229 - based on googletest
- random order
- fixture support
- no csv output
- online docs? commit activity?
- no donotoptimize
- max/min/means reported by default
DigitalInBlue/celero 249 - no dependencies
- baseline
- fixture support
- no csv output
- means reported by default
nonius.io 49-194 - header-only
- depends on boost
- super statistics summary
- confusing repo structure
- buggy example(s)
- confidence intervals fixed to
normal distribution
- no donotoptimize
google/benchmark 1985 - no dependencies
- feature rich
- templated versus fixture based setup
- means reported by default

Where to stop?

  • clearify upfront
  • where is your limit?
    • hardware
    • APIs and libraries
    • dependencies
    • compiler
    • OS

roofline

  • acknowledge boundaries of algorithm
  • differentiate "work" versus "traffic"
  • simplistic: bottleneck is either work or traffic
  • cache effects problematic

roofline in action

  • clear indication where optimisations go
  • can be used to give estimates on performance improvements (say on new hardware)
  • typically used for floating-point heavy applications
  • interesting mental model

roofline for real

complex vs. real FFT transforms
complex vs. real FFT transforms

gearshifft FFT benchmark

  • co-authored with TU Dresden
  • note the variances!
  • published at ISC'17

Bottom Line

  • your requirements are your guiding light in the dungeon
  • reproducible ensemble based benchmarks are key

Summary

Take aways

Take a balloon:

Use Tools to check the lay of the land.

Falsify the rubber duck:

Profile and check your hypothesis.

Survive the dungeon

With automated ensemble based benchmarks.

Final Words

C++ is a language considered "fast".

We, the community, need to live up to this standard and have robust and reproducible performance numbers!

 

Thank you for your attention!

Backup