The Performance Addict's Toolbox

Thomas Röhl
(HPC group of University Erlangen-Nuremburg)
Thomas.Roehl@fau.de

Peter Steinbach
(Scionics Computer Innovation GmbH)
steinbach@scionics.de

Mar 8, 2018, para//el2018

Real-life performance optimisation is never

as simple as this

more like this

by Giulia Zoccarato, DensityDesign Research Lab, CC-SA 4.0

Agenda

Motivation
Who-are-we
Performance outside-in
Performance inside-out
Benchmarks and how to create them

Who-are-we

Who am I?

Regional Computing Center Erlangen,
by Wolfgang Wiese, CC-BY 3.0

HPC group at Regional Computing Center Erlangen
Prof. Dr. Gerhard Wellein

support users on our HPC infrastructure
software and research projects related to
node-level performance engineering
developer of the Execution-Cache-Memory Model (refinement of the Roofline model)
tool development

Thomas Röhl

topics: performance engineering, node-level optimisation
main developer of LIKWID, a tool suite for performance oriented programmers

The university

Main building of University Erlangen-Nuremburg

www.fau.de

Foto by Selby, CC-SA 3.0

Friedrich-Alexander University
Erlangen-Nuremburg

≈ 40000 students
≈ 250 majors
Top3 @ Bavaria, Top 10 @ Germany

Disclaimer

github.com/psteinb/parallel2018

Original talk by Peter Steinbach
(Scientific Software Engineer,
Scionics Computer Innovation GmbH, Dresden)

Scionics provides data-driven consulting in:

Image Analysis
Code Optimisation (HPC,Multi-Core,GPU)
Large Data Life Cycle
Bioinformatics

Before I begin

Mars Climate Orbiter (1998), Public Domain

All of my slides assume, that the code provides correct results!

Nobody wants fast code, that is wrong!

Performance Outside-In

One day as a Performance Engineer

Alan O'Rourke, Too Busy To Improve - Performance Management - Square Wheels, CC BY 2.0

Once in a while

From: doe@theinstitute.de
Subject: Cluster is slow
Date: Fri, 20 Oct 2017 12:03:21 +0200
To: hpcsupport@theinstitute.de

Hi,

what is going on with the cluster? My application is running
slow since yesterday.
Could you have a look at it please?

Thanks,
John

Challenge: Finding the performance regression without looking at the code

High Level Overview

add hardware counter data: PCM

Reference Numbers

$ dd if=/dev/zero of=/tmp/just_zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 2.94478 s, 729 MB/s

$ dd if=/dev/zero of=/dev/shm/2gb.zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 1.14782 s, 1.9 GB/s

What can your hardware typically do?

dd, ior, memhog, stream, LIKWID, ...

]

Profile with perf

$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf report --stdio
no symbols found in ./my-slow-binary, maybe install a debug package?
# ...
# Total Lost Samples: 0
#
# Samples: 75  of event 'cycles:u'
# Event count (approx.): 1839654
#
# Children      Self  Command  Shared Object      Symbol           
# ........  ........  .......  .................  ................
#
    20.18%    20.18%  dd       [kernel.kallsyms]  [k] page_fault
            |          
             --19.77%--0
                       _int_realloc
                       page_fault

lightweight sample based profiling
per task, per CPU and per-workload counters
sample CPU performance counters, tracepoints or system probes
on windows: xperf/UIforETW

perf: what is a callstack?

perf: sampling based profiling

for every sampling event:
- record call stack
- query hardware counters, e.g. cpu-cycles
sampling must not be accurate

perf Reloaded with FlameGraphs

$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg

visualisation technique conceived by Brendan Gregg (Netflix)
seamless integration into perf, dtrace, systemtap, XCode Instruments, Lightweight Java Profiler, Microsoft Visual Studio profiles, ...
based on collected counter samples and the stacktrace they were collected in

Ethereum Mining as FlameGraph

(x axis) current stack level in alphabetical order

(y axis) number of samples in that stacktrace level

HPC user's slow application

Bottom Line

Taking a balloon to get an overview of performance bottlenecks is possible.

Performance Inside-Out

High Diversity of Tools!

valgrind + kcachegrind

Profile from Peter Gottschling's example on vector unrolling.

Simple Graphical output, perftools

Profile from Peter Gottschling's example on vector unrolling.

Using flamegraphs, hotspot

xray

$ CXX=clang++ make

$ XRAY_OPTIONS="patch_premain=true xray_mode=xray-basic verbosity=1" 
$ ./vector_unroll_example
==31936==XRay: Log file in 'xray-log.vector_unroll_example.ju4PNk'
Compute time native loop is 0.159 micros.
u[0] is 15
#...

$ llvm-xray account xray-log.vector_unroll_example.ju4PNk -instr_map=./vector_unroll_example
nctions with latencies: 5
 funcid      count [      min,       med,       90p,  ...]       sum  function    
      1          1 [ 7.338530,  7.338530,  7.338530,  ...]  7.338530  <invalid>:0:0: main
      2       1275 [ 0.000005,  0.000011,  0.000012,  ...]  0.013064  <invalid>:0:0: void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)

Integration in other tools started: Vampir

Proprietary tools

Found a hot spot!

Danger zone of mental models

Inspect Assembly?

perf for hardware exploration?

$ perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  #...
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  #...

perf event list depends on kernel version
hardware counters are not portable (specification change by vendors)
alternative: ocperf

Test hypothesis with likwid

performance monitoring and benchmarking suite
exploration through hardware monitoring
Linux only, GPLv3
Main focus x86, ARM and POWER upcoming

profiling hardware counters
(consistent eventset names for portability)
supports almost all provided counters and events
instrumentation API for C, C++, java and python

use case: Index Lists

#include <vector>
#include "omp.h"

struct item{
    std::vector<float> position, momentum;
    std::vector<int>   nearest_neighbors;}

int main(int argc, char** argv){
    std::vector<item> world = generate(argc*10e6);
    
    for(int& time_step : timelapse){
        update(world);
        
        #pragma omp parallel for
        for(item& it : world){
            for(int& index : it.nearest_neighbors){
                auto distance = calculate(it, world[index]);
                if(distance > threshold)
                    it.nearest_neighbors.remove(index);
            }}}
    //..
}

hypotheses:
- large 'unpredictable' jumps in memory access diminishes cache bandwidth
- false sharing forces cache line reloads as read-only and writable items may share the same cache line

Let's measure!

use case: Through Likwid

Use Case

$ export OMP_NUM_THREADS=1
$ path/to/likwid-perfctr -C 0 -g FALSE_SHARE ./my_app
+----------------------------------+--------------+
|              Metric              |    Core 0    |
+----------------------------------+--------------+
|  Local LLC false sharing [MByte] |       0.0008 |
|   Local LLC false sharing rate   | 5.608215e-10 |
+----------------------------------+--------------+

$ export OMP_NUM_THREADS=4
$ path/to/likwid-perfctr -C 0-4 -g FALSE_SHARE ./my_app
+---------------------------------------+--------------|
|                 Metric                |      Sum     |
+---------------------------------------+--------------|
|  Local LLC false sharing [MByte] STAT |    2973.7637 |
|   Local LLC false sharing rate STAT   |       0.0081 |
+---------------------------------------+--------------|

stream benchmark as reference

$ export OMP_NUM_THREADS=1
$ path/to/likwid-perfctr -C 0 -g FALSE_SHARE ./stream
+----------------------------------+--------------+
|              Metric              |    Core 0    |
+----------------------------------+--------------+
|  Local LLC false sharing [MByte] |       0.0006 |
|   Local LLC false sharing rate   | 6.057282e-10 |
+----------------------------------+--------------+

$ export OMP_NUM_THREADS=4
$ path/to/likwid-perfctr -C 0-4 -g FALSE_SHARE ./stream
+---------------------------------------+--------------|
|                 Metric                |      Sum     |
+---------------------------------------+--------------|
|  Local LLC false sharing [MByte] STAT |       0.1067 |
|   Local LLC false sharing rate STAT   | 4.080027e-07 |
+---------------------------------------+--------------|

Warning: some events provide only qualitative but not quantitative results

Interested in just a code region?

LIKWID MarkerAPI (alternative PAPI)

Instrumented version of stream's triad

#include "omp.h"
#include "likwid.h"

#define ITER 100
#define SIZE 40000000
int main(int argc, char* argv[])
{
    LIKWID_MARKER_INIT;
    //allocate and initialize a, b, c and d
#pragma omp parallel
{
    for (int k = 0; k < ITER; k++)
    {
        LIKWID_MARKER_START("triad");
#pragma simd
#pragma omp for
        for (int j = 0; j < SIZE; j++)
        {
            a[j] = b[j] + c[j] * d[j];
        }   
        LIKWID_MARKER_STOP("triad");
    }
}
    // check result in a and free a, b, c and d
    LIKWID_MARKER_CLOSE;
    return 0;
}

Add MarkerAPI calls around code regions
Link with LIKWID lib and -DLIKWID_PERFMON
Specify events and -m on perfctr cmdline
Results are grouped by region string

$ likwid-perfctr -C 0 -g L3 -m ./a.out
[...]
Region triad, Group 1: L3
+-------------------+----------+
|    Region Info    |  Core 0  |
+-------------------+----------+
| RDTSC Runtime [s] | 0.846960 |
|     call count    |      100 |
+-------------------+----------+
[...raw counter values...]
+-------------------------------+------------+
|             Metric            |   Core 0   |
+-------------------------------+------------+
|      Runtime (RDTSC) [s]      |     0.8470 |
|      Runtime unhalted [s]     |     0.8451 |
|  L3 load bandwidth [MBytes/s] | 15121.3486 |
| L3 evict bandwidth [MBytes/s] |  3298.9553 |
|    L3 bandwidth [MBytes/s]    | 18420.3040 |
+-------------------------------+------------+
[...more results...]

Bottom Line

excellent tools available to find hot spots
once "found", talk to someone
(rubber duck or colleaque(s))
create falsifiable hypotheses
MEASURE!

"Rubber Duckie, You're the One", by Daniel Rothamel, CC-BY 2.0

Benchmarks and how to create them

First, take the time

Compteur de Tierces by Louis Moinet, CC-BY 3.0

chrono is your friend

#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"

int main(int argc, char** argv){

    auto start = std::chrono::high_resolution_clock::now();
    auto result = production_code::algorithm();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    auto new_result = new_ideas::algorithm();
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
              << std::endl;
              
    return 0;
}

... check for correctness

#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"

int main(int argc, char** argv){

    auto start = std::chrono::high_resolution_clock::now();
    auto result = production_code::algorithm();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    auto new_result = new_ideas::algorithm();
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    if(result == new_result)
        std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
                  << std::endl;
    else
        std::cout << "Never mind!" << std::endl;
}

noisy lab under your fingers

#include ...

int main(int argc, char** argv){

    auto result = 0;
    auto new_result = 0;
    
    auto start = std::chrono::high_resolution_clock::now();

    for(int i = 0; i < n_repetitions; ++i)
        result = production_code::algorithm();
   
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    
    for(int i = 0; i < n_repetitions; ++i)
        new_result = new_ideas::algorithm();
        
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    if(result == new_result)
        std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
                  << std::endl;
    else
        std::cout << "Never mind!" << std::endl;
}

Please take notes

#include ...

using duration_t = std::chrono::duration<double>;

int main(int argc, char** argv){
    //..
    auto start = std::chrono::high_resolution_clock::now();
    auto end = start;
    std::vector<duration_t> prod_timings(n_repetitions), idea_timings(n_repetitions);

    for(int i = 0; i < n_repetitions; ++i) {
        start = std::chrono::high_resolution_clock::now();
        result = production_code::algorithm();
        prod_timings[i] = std::chrono::high_resolution_clock::now() - start;
    }
   
    // similar with new_ideas::algorithm() and idea_timings

    if(result == new_result) {
        std::ofstream ofile("results.csv");
        ofile.open();
        ofile << "#iteration,production,new_idea" << std::endl;
        for(int i = 0; i < n_repetitions; ++i) {
            ofile << i << "," << prod_timings[i].count() << "," << idea_timings[i].count() << std::endl;
        }
        ofile.close()
    }
    else
        std::cout << "Never mind!" << std::endl;
}

Let's take a toy example

... what if

google/benchmark

written in C++11/C++03
support of multi-threaded applications
powerful CLI

easy setup of (templated) test cases
flexible argument control
custom counters/timers

benchmark: simple approach

#include <benchmark/benchmark.h>
#include <vector>

template <typename T>
double sum(const T* _data, std::size_t _len){

    double value = 0;
    for(std::size_t i = 0; i < _len; ++i)
        value += _data[i];

    return value;
}

template <typename container_type>
double sum(const container_type& _data){

    typedef typename container_type::value_type value_t;

    double value = 0;
    for(const value_t& el : _data)
        value += el;

    return value;
}

static void BM_integer_index(benchmark::State& state) {

    const std::size_t len = 1 << 20;
    std::vector<int> values(len, 0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values.data(), len));
    }
}
// Register the function as a benchmark
BENCHMARK(BM_integer_index);

static void BM_range_based(benchmark::State& state) {

    const std::size_t len = 1 << 20;
    std::vector<int> values(len, 0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values));
    }

}
BENCHMARK(BM_range_based);

BENCHMARK_MAIN();

Analysis by Matt Godbolt at CppCon2017

benchmark: simple approach output

Run on (4 X 3600 MHz CPU s)
2017-11-08 10:24:43
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisyand will incur extra overhead.
--------------------------------------------------------
Benchmark                 Time           CPU Iterations
--------------------------------------------------------
BM_integer_index     922920 ns     915531 ns        764
BM_range_based       937344 ns     929681 ns        768

CPU scaling can be fixed with LIKWID or cpupower

benchmark: advanced

template <typename T>
static void BM_integer_index(benchmark::State& state) {

    const std::size_t len = state.range(0);
    std::vector<T> values(len, 0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values.data(), len));
    }
}

BENCHMARK_TEMPLATE(BM_integer_index,int)
->Arg(1<<20)
->Arg(128<<20);
BENCHMARK_TEMPLATE(BM_integer_index,float)
->Arg(1<<20)
->Arg(128<<20);

BENCHMARK_MAIN();

multiple arguments are also supported

BENCHMARK_TEMPLATE(BM_integer_index,int)
//42 is the initial value of the reduced sum
->Args({64, 42})
//..
;

templated benchmark cases are supported
workflow:

build benchmark
(different working set sizes, types)
compile with varying flags
run & inspect
render report with rmarkdown

benchmark: advanced output

Run on (4 X 3600 MHz CPU s)
2017-11-08 10:25:27
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_integer_index<int>/1048576         924382 ns     916717 ns        761
BM_integer_index<int>/134217728    123700290 ns  122614766 ns          6
BM_integer_index<float>/1048576       914593 ns     909174 ns        763
BM_integer_index<float>/134217728  122954355 ns  122219776 ns          6
BM_range_based<int>/1048576           912475 ns     907277 ns        761
BM_range_based<int>/134217728      122509880 ns  121832332 ns          6
BM_range_based<float>/1048576         917501 ns     912365 ns        735
BM_range_based<float>/134217728    122908318 ns  122219268 ns          6

Where to stop?

clearify upfront
where is your limit?
- hardware
- APIs and libraries
- dependencies
- compiler
- OS

roofline

acknowledge boundaries of algorithm
differentiate "work" versus "traffic"
simplistic: bottleneck is either work or traffic
clear indication where optimisations go

ecm (Execution-Cache-Memory)

ECM Model: {T_OL||T_nOL|T_L1L2|T_L2L3|T_L3MEM}

8th Order Stencil — 3D long-range SP stencil

Model:{57||54|40|24|50}

3D long-range SP stencil on Intel SNB — Stencil on Intel SNB, ECM Model for Stencils, G. Hager, ICS'15'

Roofline does not explain saturation effects
ECM puts machine characteristics into the model

some operations cannot be overlapped!
takes cache traffic into account

performance model with kerncraft

$ cat /tmp/3d.c
double U[M][N][N];
double V[M][N][N];
double ROC[M][N][N];
double c0, c1, c2, c3, c4, lap;

for(int k=4; k < M-4; k++) { for(int j = 4; j < N-4; j++) { for(int i = 4; i < N-4; i++) {
            lap = c0 * V[k][j][i]
                + c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1])
                [...10 lines...]
                + c4 * ( V[k+4][ j ][ i ] + V[k-4][ j ][ i ]);
            U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i]
                       + ROC[k][j][i] * lap; }}}

$ kerncraft -p Roofline -m SandyBridgeEP_E5-2680.yml \
/tmp/3d.c -D N 1000 -D M 1000
====================== kerncraft =======================
/tmp/3d.c               -m SandyBridgeEP_E5-2680.yml
-D N 1000 -D M 1000
------------------------------ Roofline ----------------
Cache or mem bound with 1 core(s)
7.43 GFLOP/s due to MEM transfer bottleneck
Arithmetic Intensity: 0.43 FLOP/B

$ kerncraft -p ECM -m SandyBridgeEP_E5-2680.yml \
/tmp/3d.c -D N 1000 -D M 1000
=========================== kerncraft ==================
/tmp/3d.c               -m SandyBridgeEP_E5-2680.yml
-D N 1000 -D M 1000
------------------------------------- ECM --------------
{ 57.0 || 54.0 | 40.0 | 24.0 | 50.3 } cy/CL
{ 56.95 \ 94.0 \ 118.0 \ 168.3 } cy/CL
saturating at 3.3 cores

Loop Kernel Analysis and Performance Modeling Toolkit
static code analysis to infer data reuse and cache requirements

can infer in-core and memory bottlenecks
apply performance models to benchmarked data
currently only Intel (IACA), OSACA in development

Bottom Line

your requirements are your guiding light in the dungeon
reproducible ensemble based benchmarks are key

Summary

Take aways

Take a balloon:

Use Tools to check the lay of the land.

Falsify the rubber duck:

Profile and check your hypothesis.

Survive the dungeon

With automated ensemble based benchmarks.

Backup

Textual output, gprof

$ g++ -pg -O2 -std=c++11 vector_unroll_example.cpp
$ ./a.out
$ gprof ./a.out gmon.out > analysis.txt
$ head analysis.txt
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 26.71      1.02     1.02                             void my_axpy<6u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 26.71      2.05     1.02                             void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 23.83      2.96     0.91                             void my_axpy<8u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 23.04      3.84     0.88                             void my_axpy<4u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
  0.00      3.84     0.00        1     0.00     0.00  _GLOBAL__sub_I_main

Profile from Peter Gottschling's example on vector unrolling.

Do deeper? LIKWID micro-benchmarking

DAXPY optimized for SSE (simplified)

STREAMS 2      // named STR0 and STR1
TYPE DOUBLE
FLOPS 2                   // 2 flops per iter
BYTES 24                  // use 24 bytes per iter
DESC y := alpha * x + y
INSTR_LOOP 4              // 4 instructions per iter
UOPS 5                    // 5 micro-ops per iter
movaps FPR7, [rip+SCALAR] // random value for register
LOOP 16                   // inc GPR1 by 16
movaps   FPR1, [STR0 + GPR1*8]
mulpd    FPR1, FPR7
addpd    FPR1, [STR1 + GPR1*8]
movaps   [STR1 + GPR1*8], FPR1

variety of assembly kernels
kernels written in abstracted assembly
support for SSE, AVX, AVX512, FMA, NT-Stores
manages thread and data placement
hardware counter through likwid-perfctr

$ likwid-bench -t daxpy_sse -w N:100MB:1
---------------------------------------------------
LIKWID MICRO BENCHMARK
Test: daxpy_sse
---------------------------------------------------
Using 1 work groups
Using 1 threads
---------------------------------------------------
Cycles:     5278051829
Time:     1.316899e+00 sec
[...]
MFlops/s:   1214.98
MByte/s:    14579.71
[...]
Instructions:   1900000017
UOPs:     2600000000
-----------------------------------------------------

$ likwid-perfctr -C 0 -g FLOPS_DP -m \
likwid-bench -t daxpy_sse -w N:100MB:1
---------------------------------------------------
bench:
MFlops/s:   1213.88
Instructions:   1900000017
---------------------------------------------------
perfctr:
DP MFLOP/s    1213.8804
INSTR_RETIRED_ANY 1900008000
---------------------------------------------------

Life as a reviewer

T. Hoefler et al, "Scientific Benchmarking of Parallel Computing Systems - Twelve ways to tell the masses when reporting performance results", SC '15 Proceedings, 2015

Problem also in other fields: AI

Standardized, easy-to-parse output!

T. Hoefler et al, "Scientific Benchmarking of Parallel Computing Systems - Twelve ways to tell the masses when reporting performance results",
SC '15 Proceedings, 2015

Can't this be automated?