The Performance Addict's Toolbox

Peter Steinbach
(Scionics Computer Innovation GmbH)
steinbach@scionics.de

Nov 10, 2017, MeetingC++ 2017, Berlin

Real-life performance optimisation is never

as simple as this

more like this

Agenda

Motivation
Who-am-I
Performance outside-in
Performance inside-out
Benchmarks and how to create them

Who-am-I and Motivation

Who am I?

Scionics Computer Innovation GmbH

software and consulting company
founded in 2001 in Dresden, Germany
expertise in data analysis, bioinformatics, image analysis, HPC, ...

Our Client

MPI for Molecular Cell Biology and Genetics

mpi-cbg.de

500+ staff
my role: Scientific Software Engineer
support users on our HPC infrastructure
software projects related to performance (think multi-threaded, GPUs, ..)

Disclaimer

github.com/psteinb/meetingcpp2017

Before I begin

All of my slides assume, that the code provides correct results!

Nobody wants fast code, that is wrong!

Performance Outside-In

One day as a Performance Engineer

Alan O'Rourke, Too Busy To Improve - Performance Management - Square Wheels, CC

Once in a while

From: doe@theinstitute.de
Subject: Cluster is slow
Date: Fri, 20 Oct 2017 12:03:21 +0200
To: hpcsupport@theinstitute.de

Hi,

what is going on with the cluster? My application is running
slow since yesterday.
Could you have a look at it please?

Thanks,
John

Challenge: Finding the performance regression without looking at the code

High Level Overview

Reference Numbers

$ dd if=/dev/zero of=/tmp/just_zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 2.94478 s, 729 MB/s

$ dd if=/dev/zero of=/dev/shm/2gb.zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 1.14782 s, 1.9 GB/s

What can your hardware typically do?

dd, ior, memhog, stream, ...

Profile with perf

$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf report --stdio
no symbols found in /usr/bin/dd, maybe install a debug package?
# ...
# Total Lost Samples: 0
#
# Samples: 75  of event 'cycles:u'
# Event count (approx.): 1839654
#
# Children      Self  Command  Shared Object      Symbol           
# ........  ........  .......  .................  ................
#
    20.18%    20.18%  dd       [kernel.kallsyms]  [k] page_fault
            |          
             --19.77%--0
                       _int_realloc
                       page_fault

lightweight sample based profiling
per task, per CPU and per-workload counters
sample CPU performance counters, tracepoints or system probes
on windows: xperf/UIforETW

perf: what is a callstack?

perf: sampling based profiling

for every sampling event:
- record call stack
- query hardware counters, e.g. cpu-cycles
sampling must not be accurate

perf Reloaded with FlameGraphs

$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg

visualisation technique conceived by Brendan Gregg (Netflix)
seamless integration into perf, dtrace, systemtap, XCode Instruments, Lightweight Java Profiler, Microsoft Visual Studio profiles, ...
based on collected counter samples and the stacktrace they were collected in

Ethereum Mining as FlameGraph

(x axis) current stack level in alphabetical order

(y axis) number of samples in that stacktrace level

HPC user's slow application

Bottom Line

Taking a balloon to get an overview of performance bottlenecks is possible.

Performance Inside-Out

High Diversity of Tools!

Textual output, gprof

$ g++ -pg -O2 -std=c++11 vector_unroll_example.cpp
$ ./a.out
$ gprof ./a.out gmon.out > analysis.txt
$ head analysis.txt
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 26.71      1.02     1.02                             void my_axpy<6u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 26.71      2.05     1.02                             void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 23.83      2.96     0.91                             void my_axpy<8u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 23.04      3.84     0.88                             void my_axpy<4u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
  0.00      3.84     0.00        1     0.00     0.00  _GLOBAL__sub_I_main

Profile from Peter Gottschling's example on vector unrolling.

Simple Graphical output, perftools

Profile from Peter Gottschling's example on vector unrolling.

valgrind + kcachegrind

Profile from Peter Gottschling's example on vector unrolling.

Using flamegraphs, hotspot

Proprietary tools

Found a hot spot!

Danger zone of mental models

Inspect Assembly?

perf for hardware exploration?

$ perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  #...
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  #...

perf event list depends on kernel version
hardware counters are not portable (specification change by vendors)
alternative: ocperf

Test hypothesis with likwid

github.com/RRZE-HPC/likwid
open source Performance monitoring and benchmarking suite
Linux only

profiling through hardware counters (consistent meta markers for portability)
exploration through monitoring
marker API for C, C++, java and python

use case: Index Lists

#include <vector>
#include "omp.h"

struct item{
    std::vector<float> position, momentum;
    std::vector<int>   nearest_neighbors;}

int main(int argc, char** argv){
    std::vector<item> world = generate(argc*10e6);
    
    for(int& time_step : timelapse){
        update(world);
        
        #pragma omp parallel for
        for(item& it : world){
            for(int& index : it.nearest_neighbors){
                auto distance = calculate(it, world[index]);
                if(distance > threshold)
                    it.nearest_neighbors.remove(index);
            }}}
    //..
}

hypotheses:
- large 'unpredictable' jumps in memory access diminishes cache bandwidth
- false sharing forces cache line reloads as read-only and writable items may share the same cache line

Let's measure!

use case: Through Likwid

Use Case

# export OMP_NUM_THREADS=1
# path/to/likwid-perfctr -f -c 0 -g FALSE_SHARE \
numactl -m0 -C0 ./my_app
+----------------------------------|--------------+
|              Metric              |    Core 0    |
+----------------------------------|--------------+
//..
|  Local LLC false sharing [MByte] |       0.0008 |
|   Local LLC false sharing rate   | 5.608215e-10 |
//..
+----------------------------------|--------------+

# export OMP_NUM_THREADS=4
# path/to/likwid-perfctr -f -c 0-4 -g FALSE_SHARE \
numactl -m0 -C0-3 ./my_app
+---------------------------------------|--------------|
|                 Metric                |      Sum     |
+---------------------------------------|--------------|
#..
|  Local LLC false sharing [MByte] STAT |    2973.7637 |
|   Local LLC false sharing rate STAT   |       0.0081 |
#..
+---------------------------------------|--------------|

Stream Benchmark as Reference

# export OMP_NUM_THREADS=1
# path/to/likwid-perfctr -f -c 0 -g FALSE_SHARE \
numactl -m0 -C0 ./stream
+----------------------------------|--------------+
|              Metric              |    Core 0    |
+----------------------------------|--------------+
#..
|  Local LLC false sharing [MByte] |       0.0006 |
|   Local LLC false sharing rate   | 6.057282e-10 |
#..
+----------------------------------|--------------+

# export OMP_NUM_THREADS=4
# path/to/likwid-perfctr -f -c 0-4 -g FALSE_SHARE \
numactl -m0 -C0-3 ./stream
+---------------------------------------|--------------|
|                 Metric                |      Sum     |
+---------------------------------------|--------------|
#..
|  Local LLC false sharing [MByte] STAT |       0.1067 |
|   Local LLC false sharing rate STAT   | 4.080027e-07 |
#..
+---------------------------------------|--------------|

Bottom Line

excellent tools available to find hot spots
once "found", talk to someone
(rubber duck or colleaque(s))
create falsifiable hypotheses
MEASURE!

"Rubber Duckie, You're the One", by Daniel Rothamel, CC-BY 2.0

Benchmarks and how to create them

faster code?

Chandler Carruth, Understanding Compiler Optimization, MeetingCPP 2015

Klaus Iglberger: Guys that do know a lot about performance, do a lot of manual unrolling (manual vectorization). Apparently they don't trust the compiler too much. What is your take on this?

Chandler: How do you define "poeple who know a lot about performance"? Serious question. So I work with Google's optimisation team who is responsible for making our C++ code run fast. And I have never seen them manually unroll a loop.

chrono is your friend

#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"

int main(int argc, char** argv){

    auto start = std::chrono::high_resolution_clock::now();
    auto result = production_code::algorithm();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    auto new_result = new_ideas::algorithm();
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
              << std::endl;
              
    return 0;
}

... check for correctness

#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"

int main(int argc, char** argv){

    auto start = std::chrono::high_resolution_clock::now();
    auto result = production_code::algorithm();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    auto new_result = new_ideas::algorithm();
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    if(result == new_result)
        std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
                  << std::endl;
    else
        std::cout << "Never mind!" << std::endl;
}

noisy lab under your fingers

#include ...

int main(int argc, char** argv){

    auto result = 0;
    auto new_result = 0;
    
    auto start = std::chrono::high_resolution_clock::now();

    for(int i = 0;i<n_repetitions;++i)
        result = production_code::algorithm();
   
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    
    for(int i = 0;i<n_repetitions;++i)
        new_result = new_ideas::algorithm();
        
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    if(result == new_result)
        std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
                  << std::endl;
    else
        std::cout << "Never mind!" << std::endl;
}

Please take notes

#include ...

using duration_t = std::chrono::duration<double>;

int main(int argc, char** argv){
    //..
    auto start = std::chrono::high_resolution_clock::now();
    auto end = start;
    std::vector<duration_t> my_timings(n_repetitions);

    for(int i = 0;i<n_repetitions;++i){
        start = std::chrono::high_resolution_clock::now();
        result = production_code::algorithm();
        my_timings[i] = std::chrono::high_resolution_clock::now() - start;
    }
   
    //same with new_result = new_ideas::algorithm()

    if(result == new_result){
        std::ofstream ofile("results.csv");ofile.open();
        for(int i = 0;i<n_repetitions;++i){
            ofile << i << ",production," << prod_timings[i].count() << ",seconds" << std::endl;
        }
        //same with new_idea
        ofile.close()
    }
    else
        std::cout << "Never mind!" << std::endl;
}

Why?

... It's a simple Python interface around a blazing fast C++ library ...

from github.com/vincentlaucsb/csvmorph

... However C++ code used to be significantly faster for a long time, and also today still is in many cases.

from SO "How much faster is C++ than C#?"

see more for youself!

Life as a reviewer

T. Hoefler et al, "Scientific Benchmarking of Parallel Computing Systems - Twelve ways to tell the masses when reporting performance results",
SC '15 Proceedings, 2015

Let's take a toy example

... what if

Standardized, easy-to-parse output!

T. Hoefler et al, "Scientific Benchmarking of Parallel Computing Systems - Twelve ways to tell the masses when reporting performance results",
SC '15 Proceedings, 2015

Can't this be automated?

google/benchmark

written in C++11/C++03
support of multi-threaded applications
powerful CLI

easy setup of (templated) test cases
flexible argument control
custom counters/timers

benchmark: in action

Question:

Are range-based for loops faster than integer based ones depending on the data type used?

benchmark: simple approach

#include <benchmark/benchmark.h>
#include <vector>

template <typename T>
double sum(const T* _data, std::size_t _len){

    double value = 0;
    for(std::size_t i = 0;i<_len;++i)
        value += _data[i];

    return value;
}

template <typename container_type>
double sum(const container_type& _data){

    typedef typename container_type::value_type value_t;

    double value = 0;
    for(const value_t& el : _data)
        value += el;

    return value;
}

static void BM_integer_index(benchmark::State& state) {

    const std::size_t len = 1 << 20;
    std::vector<int> values(len,0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values.data(), len));
    }
}
// Register the function as a benchmark
BENCHMARK(BM_integer_index);

static void BM_range_based(benchmark::State& state) {

    const std::size_t len = 1 << 20;
    std::vector<int> values(len,0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values));
    }

}
BENCHMARK(BM_range_based);

BENCHMARK_MAIN();

benchmark: simple approach output

Run on (4 X 3600 MHz CPU s)
2017-11-08 10:24:43
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------
Benchmark                 Time           CPU Iterations
--------------------------------------------------------
BM_integer_index     922920 ns     915531 ns        764
BM_range_based       937344 ns     929681 ns        768

benchmark: advanced

template <typename T>
static void BM_integer_index(benchmark::State& state) {

    const std::size_t len = state.range(0);
    std::vector<T> values(len,0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values.data(), len));
    }
}

BENCHMARK_TEMPLATE(BM_integer_index,int)
->Arg(64)
->Arg(512)
->Arg(1 << 10)
->Arg(128<<10)
->Arg(1<<20)
->Arg(128<<20);
BENCHMARK_TEMPLATE(BM_integer_index,float)
->Arg(64)
->Arg(512)
->Arg(1 << 10)
->Arg(128<<10)
->Arg(1<<20)
->Arg(128<<20);

BENCHMARK_MAIN();

multiple arguments are also supported

BENCHMARK_TEMPLATE(BM_integer_index,int)
//42 is the initial value of the reduced sum
->Arg({64, 42})
//..
;

templated benchmark cases are supported
workflow:

build benchmark
(different working set sizes, types)
compile with varying flags
run & inspect
render report with rmarkdown

benchmark: advanced output

Run on (4 X 3600 MHz CPU s)
2017-11-08 10:25:27
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_integer_index<int>/64                  50 ns         50 ns   10000000
BM_integer_index<int>/512                424 ns        423 ns    1634359
BM_integer_index<int>/1024               864 ns        861 ns     772101
BM_integer_index<int>/131072          112125 ns     111777 ns       6344
BM_integer_index<int>/1048576         924382 ns     916717 ns        761
BM_integer_index<int>/134217728    123700290 ns  122614766 ns          6
BM_integer_index<float>/64                52 ns         51 ns   13374011
BM_integer_index<float>/512              427 ns        425 ns    1641151
BM_integer_index<float>/1024             863 ns        860 ns     772552
BM_integer_index<float>/131072        111322 ns     111160 ns       6109
BM_integer_index<float>/1048576       914593 ns     909174 ns        763
BM_integer_index<float>/134217728  122954355 ns  122219776 ns          6
BM_range_based<int>/64                    50 ns         50 ns   13580124
BM_range_based<int>/512                  429 ns        429 ns    1570356
BM_range_based<int>/1024                 866 ns        865 ns     815042
BM_range_based<int>/131072            111289 ns     111139 ns       6316
BM_range_based<int>/1048576           912475 ns     907277 ns        761
BM_range_based<int>/134217728      122509880 ns  121832332 ns          6
BM_range_based<float>/64                  48 ns         48 ns   13944707
BM_range_based<float>/512                426 ns        426 ns    1637659
BM_range_based<float>/1024               863 ns        862 ns     810743
BM_range_based<float>/131072          110915 ns     110775 ns       6343
BM_range_based<float>/1048576         917501 ns     912365 ns        735
BM_range_based<float>/134217728    122908318 ns  122219268 ns          6

benchmark: reproduce this!

gcc 6.4.1, libbbenchmark 1.3, code available in this repo — gcc 6.4.1, libbbenchmark 1.3,
code available in this repo

file an issue if you reproduced this!

$ cd src/libbenchmark
$ //install libbenchmark & tidyverse R package
$ CXXFLAGS=-O2 make report

There is more

nickbruun/hayai	229	- based on googletest - random order - fixture support	- no csv output - online docs? commit activity? - no donotoptimize - max/min/means reported by default
DigitalInBlue/celero	249	- no dependencies - baseline - fixture support	- no csv output - means reported by default
nonius.io	49-194	- header-only - depends on boost - super statistics summary	- confusing repo structure - buggy example(s) - confidence intervals fixed to normal distribution - no donotoptimize
google/benchmark	1985	- no dependencies - feature rich	- templated versus fixture based setup - means reported by default