Thomas Röhl
Thomas.Roehl@fau.de
Peter Steinbach
steinbach@scionics.de
HPC group at Regional Computing Center ErlangenProf. Dr. Gerhard Wellein
Friedrich-Alexander UniversityErlangen-Nuremburg
Original talk by Peter Steinbach (Scientific Software Engineer, Scionics Computer Innovation GmbH, Dresden)
Scionics provides data-driven consulting in:All of my slides assume, that the code provides correct results!
Nobody wants fast code, that is wrong!
From: doe@theinstitute.de
Subject: Cluster is slow
Date: Fri, 20 Oct 2017 12:03:21 +0200
To: hpcsupport@theinstitute.de
Hi,
what is going on with the cluster? My application is running
slow since yesterday.
Could you have a look at it please?
Thanks,
John
add hardware counter data: PCM
$ dd if=/dev/zero of=/tmp/just_zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 2.94478 s, 729 MB/s
$ dd if=/dev/zero of=/dev/shm/2gb.zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 1.14782 s, 1.9 GB/s
What can your hardware typically do?
dd, ior, memhog, stream, LIKWID, ...
]
$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf report --stdio
no symbols found in ./my-slow-binary, maybe install a debug package?
# ...
# Total Lost Samples: 0
#
# Samples: 75 of event 'cycles:u'
# Event count (approx.): 1839654
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ................. ................
#
20.18% 20.18% dd [kernel.kallsyms] [k] page_fault
|
--19.77%--0
_int_realloc
page_fault
$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg
Taking a balloon to get an overview of performance bottlenecks is possible.
Profile from Peter Gottschling's example on vector unrolling.
Profile from Peter Gottschling's example on vector unrolling.
$ CXX=clang++ make
$ XRAY_OPTIONS="patch_premain=true xray_mode=xray-basic verbosity=1"
$ ./vector_unroll_example
==31936==XRay: Log file in 'xray-log.vector_unroll_example.ju4PNk'
Compute time native loop is 0.159 micros.
u[0] is 15
#...
$ llvm-xray account xray-log.vector_unroll_example.ju4PNk -instr_map=./vector_unroll_example
nctions with latencies: 5
funcid count [ min, med, 90p, ...] sum function
1 1 [ 7.338530, 7.338530, 7.338530, ...] 7.338530 <invalid>:0:0: main
2 1275 [ 0.000005, 0.000011, 0.000012, ...] 0.013064 <invalid>:0:0: void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
Integration in other tools started: Vampir
$ perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
#...
L1-dcache-load-misses [Hardware cache event]
L1-dcache-loads [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
#...
#include <vector>
#include "omp.h"
struct item{
std::vector<float> position, momentum;
std::vector<int> nearest_neighbors;}
int main(int argc, char** argv){
std::vector<item> world = generate(argc*10e6);
for(int& time_step : timelapse){
update(world);
#pragma omp parallel for
for(item& it : world){
for(int& index : it.nearest_neighbors){
auto distance = calculate(it, world[index]);
if(distance > threshold)
it.nearest_neighbors.remove(index);
}}}
//..
}
hypotheses:
large 'unpredictable' jumps in memory access diminishes cache bandwidth
false sharing forces cache line reloads as read-only and writable items may share the same cache line
Let's measure!
Use Case
$ export OMP_NUM_THREADS=1 $ path/to/likwid-perfctr -C 0 -g FALSE_SHARE ./my_app +----------------------------------+--------------+ | Metric | Core 0 | +----------------------------------+--------------+ | Local LLC false sharing [MByte] | 0.0008 | | Local LLC false sharing rate | 5.608215e-10 | +----------------------------------+--------------+
$ export OMP_NUM_THREADS=4 $ path/to/likwid-perfctr -C 0-4 -g FALSE_SHARE ./my_app +---------------------------------------+--------------| | Metric | Sum | +---------------------------------------+--------------| | Local LLC false sharing [MByte] STAT | 2973.7637 | | Local LLC false sharing rate STAT | 0.0081 | +---------------------------------------+--------------|
stream benchmark as reference
$ export OMP_NUM_THREADS=1 $ path/to/likwid-perfctr -C 0 -g FALSE_SHARE ./stream +----------------------------------+--------------+ | Metric | Core 0 | +----------------------------------+--------------+ | Local LLC false sharing [MByte] | 0.0006 | | Local LLC false sharing rate | 6.057282e-10 | +----------------------------------+--------------+
$ export OMP_NUM_THREADS=4 $ path/to/likwid-perfctr -C 0-4 -g FALSE_SHARE ./stream +---------------------------------------+--------------| | Metric | Sum | +---------------------------------------+--------------| | Local LLC false sharing [MByte] STAT | 0.1067 | | Local LLC false sharing rate STAT | 4.080027e-07 | +---------------------------------------+--------------|
LIKWID MarkerAPI (alternative PAPI)
Instrumented version of stream's triad
#include "omp.h"
#include "likwid.h"
#define ITER 100
#define SIZE 40000000
int main(int argc, char* argv[])
{
LIKWID_MARKER_INIT;
//allocate and initialize a, b, c and d
#pragma omp parallel
{
for (int k = 0; k < ITER; k++)
{
LIKWID_MARKER_START("triad");
#pragma simd
#pragma omp for
for (int j = 0; j < SIZE; j++)
{
a[j] = b[j] + c[j] * d[j];
}
LIKWID_MARKER_STOP("triad");
}
}
// check result in a and free a, b, c and d
LIKWID_MARKER_CLOSE;
return 0;
}
$ likwid-perfctr -C 0 -g L3 -m ./a.out
[...]
Region triad, Group 1: L3
+-------------------+----------+
| Region Info | Core 0 |
+-------------------+----------+
| RDTSC Runtime [s] | 0.846960 |
| call count | 100 |
+-------------------+----------+
[...raw counter values...]
+-------------------------------+------------+
| Metric | Core 0 |
+-------------------------------+------------+
| Runtime (RDTSC) [s] | 0.8470 |
| Runtime unhalted [s] | 0.8451 |
| L3 load bandwidth [MBytes/s] | 15121.3486 |
| L3 evict bandwidth [MBytes/s] | 3298.9553 |
| L3 bandwidth [MBytes/s] | 18420.3040 |
+-------------------------------+------------+
[...more results...]
#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"
int main(int argc, char** argv){
auto start = std::chrono::high_resolution_clock::now();
auto result = production_code::algorithm();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_p = (end - start);
start = std::chrono::high_resolution_clock::now();
auto new_result = new_ideas::algorithm();
end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_i = (end - start);
std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count()
<< std::endl;
return 0;
}
#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"
int main(int argc, char** argv){
auto start = std::chrono::high_resolution_clock::now();
auto result = production_code::algorithm();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_p = (end - start);
start = std::chrono::high_resolution_clock::now();
auto new_result = new_ideas::algorithm();
end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_i = (end - start);
if(result == new_result)
std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count()
<< std::endl;
else
std::cout << "Never mind!" << std::endl;
}
#include ...
int main(int argc, char** argv){
auto result = 0;
auto new_result = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < n_repetitions; ++i)
result = production_code::algorithm();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_p = (end - start);
start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < n_repetitions; ++i)
new_result = new_ideas::algorithm();
end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_i = (end - start);
if(result == new_result)
std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count()
<< std::endl;
else
std::cout << "Never mind!" << std::endl;
}
#include ...
using duration_t = std::chrono::duration<double>;
int main(int argc, char** argv){
//..
auto start = std::chrono::high_resolution_clock::now();
auto end = start;
std::vector<duration_t> prod_timings(n_repetitions), idea_timings(n_repetitions);
for(int i = 0; i < n_repetitions; ++i) {
start = std::chrono::high_resolution_clock::now();
result = production_code::algorithm();
prod_timings[i] = std::chrono::high_resolution_clock::now() - start;
}
// similar with new_ideas::algorithm() and idea_timings
if(result == new_result) {
std::ofstream ofile("results.csv");
ofile.open();
ofile << "#iteration,production,new_idea" << std::endl;
for(int i = 0; i < n_repetitions; ++i) {
ofile << i << "," << prod_timings[i].count() << "," << idea_timings[i].count() << std::endl;
}
ofile.close()
}
else
std::cout << "Never mind!" << std::endl;
}
#include <benchmark/benchmark.h>
#include <vector>
template <typename T>
double sum(const T* _data, std::size_t _len){
double value = 0;
for(std::size_t i = 0; i < _len; ++i)
value += _data[i];
return value;
}
template <typename container_type>
double sum(const container_type& _data){
typedef typename container_type::value_type value_t;
double value = 0;
for(const value_t& el : _data)
value += el;
return value;
}
static void BM_integer_index(benchmark::State& state) {
const std::size_t len = 1 << 20;
std::vector<int> values(len, 0.f);
double result = 0;
for (auto _ : state){
benchmark::DoNotOptimize(result = sum(values.data(), len));
}
}
// Register the function as a benchmark
BENCHMARK(BM_integer_index);
static void BM_range_based(benchmark::State& state) {
const std::size_t len = 1 << 20;
std::vector<int> values(len, 0.f);
double result = 0;
for (auto _ : state){
benchmark::DoNotOptimize(result = sum(values));
}
}
BENCHMARK(BM_range_based);
BENCHMARK_MAIN();
Analysis by Matt Godbolt at CppCon2017
Run on (4 X 3600 MHz CPU s)
2017-11-08 10:24:43
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisyand will incur extra overhead.
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
BM_integer_index 922920 ns 915531 ns 764
BM_range_based 937344 ns 929681 ns 768
template <typename T>
static void BM_integer_index(benchmark::State& state) {
const std::size_t len = state.range(0);
std::vector<T> values(len, 0.f);
double result = 0;
for (auto _ : state){
benchmark::DoNotOptimize(result = sum(values.data(), len));
}
}
BENCHMARK_TEMPLATE(BM_integer_index,int)
->Arg(1<<20)
->Arg(128<<20);
BENCHMARK_TEMPLATE(BM_integer_index,float)
->Arg(1<<20)
->Arg(128<<20);
BENCHMARK_MAIN();
multiple arguments are also supported
BENCHMARK_TEMPLATE(BM_integer_index,int)
//42 is the initial value of the reduced sum
->Args({64, 42})
//..
;
workflow:
Run on (4 X 3600 MHz CPU s)
2017-11-08 10:25:27
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_integer_index<int>/1048576 924382 ns 916717 ns 761
BM_integer_index<int>/134217728 123700290 ns 122614766 ns 6
BM_integer_index<float>/1048576 914593 ns 909174 ns 763
BM_integer_index<float>/134217728 122954355 ns 122219776 ns 6
BM_range_based<int>/1048576 912475 ns 907277 ns 761
BM_range_based<int>/134217728 122509880 ns 121832332 ns 6
BM_range_based<float>/1048576 917501 ns 912365 ns 735
BM_range_based<float>/134217728 122908318 ns 122219268 ns 6
ECM Model: {TOL||TnOL|TL1L2|TL2L3|TL3MEM}
Model:{57||54|40|24|50}
$ cat /tmp/3d.c
double U[M][N][N];
double V[M][N][N];
double ROC[M][N][N];
double c0, c1, c2, c3, c4, lap;
for(int k=4; k < M-4; k++) { for(int j = 4; j < N-4; j++) { for(int i = 4; i < N-4; i++) {
lap = c0 * V[k][j][i]
+ c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1])
[...10 lines...]
+ c4 * ( V[k+4][ j ][ i ] + V[k-4][ j ][ i ]);
U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i]
+ ROC[k][j][i] * lap; }}}
$ kerncraft -p Roofline -m SandyBridgeEP_E5-2680.yml \
/tmp/3d.c -D N 1000 -D M 1000
====================== kerncraft =======================
/tmp/3d.c -m SandyBridgeEP_E5-2680.yml
-D N 1000 -D M 1000
------------------------------ Roofline ----------------
Cache or mem bound with 1 core(s)
7.43 GFLOP/s due to MEM transfer bottleneck
Arithmetic Intensity: 0.43 FLOP/B
$ kerncraft -p ECM -m SandyBridgeEP_E5-2680.yml \
/tmp/3d.c -D N 1000 -D M 1000
=========================== kerncraft ==================
/tmp/3d.c -m SandyBridgeEP_E5-2680.yml
-D N 1000 -D M 1000
------------------------------------- ECM --------------
{ 57.0 || 54.0 | 40.0 | 24.0 | 50.3 } cy/CL
{ 56.95 \ 94.0 \ 118.0 \ 168.3 } cy/CL
saturating at 3.3 cores
Take a balloon:
Use Tools to check the lay of the land.
Falsify the rubber duck:
Profile and check your hypothesis.
Survive the dungeon
With automated ensemble based benchmarks.
$ g++ -pg -O2 -std=c++11 vector_unroll_example.cpp
$ ./a.out
$ gprof ./a.out gmon.out > analysis.txt
$ head analysis.txt
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
26.71 1.02 1.02 void my_axpy<6u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
26.71 2.05 1.02 void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
23.83 2.96 0.91 void my_axpy<8u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
23.04 3.84 0.88 void my_axpy<4u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
0.00 3.84 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
Profile from Peter Gottschling's example on vector unrolling.
DAXPY optimized for SSE (simplified)
STREAMS 2 // named STR0 and STR1
TYPE DOUBLE
FLOPS 2 // 2 flops per iter
BYTES 24 // use 24 bytes per iter
DESC y := alpha * x + y
INSTR_LOOP 4 // 4 instructions per iter
UOPS 5 // 5 micro-ops per iter
movaps FPR7, [rip+SCALAR] // random value for register
LOOP 16 // inc GPR1 by 16
movaps FPR1, [STR0 + GPR1*8]
mulpd FPR1, FPR7
addpd FPR1, [STR1 + GPR1*8]
movaps [STR1 + GPR1*8], FPR1
$ likwid-bench -t daxpy_sse -w N:100MB:1
---------------------------------------------------
LIKWID MICRO BENCHMARK
Test: daxpy_sse
---------------------------------------------------
Using 1 work groups
Using 1 threads
---------------------------------------------------
Cycles: 5278051829
Time: 1.316899e+00 sec
[...]
MFlops/s: 1214.98
MByte/s: 14579.71
[...]
Instructions: 1900000017
UOPs: 2600000000
-----------------------------------------------------
$ likwid-perfctr -C 0 -g FLOPS_DP -m \
likwid-bench -t daxpy_sse -w N:100MB:1
---------------------------------------------------
bench:
MFlops/s: 1213.88
Instructions: 1900000017
---------------------------------------------------
perfctr:
DP MFLOP/s 1213.8804
INSTR_RETIRED_ANY 1900008000
---------------------------------------------------
T. Hoefler et al, "Scientific Benchmarking of Parallel Computing Systems - Twelve ways to tell the masses when reporting performance results", SC '15 Proceedings, 2015
Problem also in other fields: AI
T. Hoefler et al, "Scientific Benchmarking of Parallel Computing Systems - Twelve ways to tell the masses when reporting performance results",
SC '15 Proceedings, 2015
Can't this be automated?