Peter Steinbach
steinbach@scionics.de
Scionics Computer Innovation GmbH
All of my slides assume, that the code provides correct results!
Nobody wants fast code, that is wrong!
From: doe@theinstitute.de
Subject: Cluster is slow
Date: Fri, 20 Oct 2017 12:03:21 +0200
To: hpcsupport@theinstitute.de
Hi,
what is going on with the cluster? My application is running
slow since yesterday.
Could you have a look at it please?
Thanks,
John
$ dd if=/dev/zero of=/tmp/just_zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 2.94478 s, 729 MB/s
$ dd if=/dev/zero of=/dev/shm/2gb.zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 1.14782 s, 1.9 GB/s
What can your hardware typically do?
$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf report --stdio
no symbols found in /usr/bin/dd, maybe install a debug package?
# ...
# Total Lost Samples: 0
#
# Samples: 75 of event 'cycles:u'
# Event count (approx.): 1839654
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ................. ................
#
20.18% 20.18% dd [kernel.kallsyms] [k] page_fault
|
--19.77%--0
_int_realloc
page_fault
$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg
Taking a balloon to get an overview of performance bottlenecks is possible.
$ g++ -pg -O2 -std=c++11 vector_unroll_example.cpp
$ ./a.out
$ gprof ./a.out gmon.out > analysis.txt
$ head analysis.txt
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
26.71 1.02 1.02 void my_axpy<6u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
26.71 2.05 1.02 void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
23.83 2.96 0.91 void my_axpy<8u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
23.04 3.84 0.88 void my_axpy<4u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
0.00 3.84 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
Profile from Peter Gottschling's example on vector unrolling.
Profile from Peter Gottschling's example on vector unrolling.
Profile from Peter Gottschling's example on vector unrolling.
$ perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
#...
L1-dcache-load-misses [Hardware cache event]
L1-dcache-loads [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
#...
#include <vector>
#include "omp.h"
struct item{
std::vector<float> position, momentum;
std::vector<int> nearest_neighbors;}
int main(int argc, char** argv){
std::vector<item> world = generate(argc*10e6);
for(int& time_step : timelapse){
update(world);
#pragma omp parallel for
for(item& it : world){
for(int& index : it.nearest_neighbors){
auto distance = calculate(it, world[index]);
if(distance > threshold)
it.nearest_neighbors.remove(index);
}}}
//..
}
hypotheses:
large 'unpredictable' jumps in memory access diminishes cache bandwidth
false sharing forces cache line reloads as read-only and writable items may share the same cache line
Let's measure!
Use Case
# export OMP_NUM_THREADS=1
# path/to/likwid-perfctr -f -c 0 -g FALSE_SHARE \
numactl -m0 -C0 ./my_app
+----------------------------------|--------------+
| Metric | Core 0 |
+----------------------------------|--------------+
//..
| Local LLC false sharing [MByte] | 0.0008 |
| Local LLC false sharing rate | 5.608215e-10 |
//..
+----------------------------------|--------------+
# export OMP_NUM_THREADS=4
# path/to/likwid-perfctr -f -c 0-4 -g FALSE_SHARE \
numactl -m0 -C0-3 ./my_app
+---------------------------------------|--------------|
| Metric | Sum |
+---------------------------------------|--------------|
#..
| Local LLC false sharing [MByte] STAT | 2973.7637 |
| Local LLC false sharing rate STAT | 0.0081 |
#..
+---------------------------------------|--------------|
Stream Benchmark as Reference
# export OMP_NUM_THREADS=1
# path/to/likwid-perfctr -f -c 0 -g FALSE_SHARE \
numactl -m0 -C0 ./stream
+----------------------------------|--------------+
| Metric | Core 0 |
+----------------------------------|--------------+
#..
| Local LLC false sharing [MByte] | 0.0006 |
| Local LLC false sharing rate | 6.057282e-10 |
#..
+----------------------------------|--------------+
# export OMP_NUM_THREADS=4
# path/to/likwid-perfctr -f -c 0-4 -g FALSE_SHARE \
numactl -m0 -C0-3 ./stream
+---------------------------------------|--------------|
| Metric | Sum |
+---------------------------------------|--------------|
#..
| Local LLC false sharing [MByte] STAT | 0.1067 |
| Local LLC false sharing rate STAT | 4.080027e-07 |
#..
+---------------------------------------|--------------|
Chandler Carruth, Understanding Compiler Optimization, MeetingCPP 2015
Klaus Iglberger: Guys that do know a lot about performance, do a lot of manual unrolling (manual vectorization). Apparently they don't trust the compiler too much. What is your take on this?
Chandler: How do you define "poeple who know a lot about performance"? Serious question. So I work with Google's optimisation team who is responsible for making our C++ code run fast. And I have never seen them manually unroll a loop.
#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"
int main(int argc, char** argv){
auto start = std::chrono::high_resolution_clock::now();
auto result = production_code::algorithm();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_p = (end - start);
start = std::chrono::high_resolution_clock::now();
auto new_result = new_ideas::algorithm();
end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_i = (end - start);
std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count()
<< std::endl;
return 0;
}
#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"
int main(int argc, char** argv){
auto start = std::chrono::high_resolution_clock::now();
auto result = production_code::algorithm();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_p = (end - start);
start = std::chrono::high_resolution_clock::now();
auto new_result = new_ideas::algorithm();
end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_i = (end - start);
if(result == new_result)
std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count()
<< std::endl;
else
std::cout << "Never mind!" << std::endl;
}
#include ...
int main(int argc, char** argv){
auto result = 0;
auto new_result = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0;i<n_repetitions;++i)
result = production_code::algorithm();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_p = (end - start);
start = std::chrono::high_resolution_clock::now();
for(int i = 0;i<n_repetitions;++i)
new_result = new_ideas::algorithm();
end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> t_i = (end - start);
if(result == new_result)
std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count()
<< std::endl;
else
std::cout << "Never mind!" << std::endl;
}
#include ...
using duration_t = std::chrono::duration<double>;
int main(int argc, char** argv){
//..
auto start = std::chrono::high_resolution_clock::now();
auto end = start;
std::vector<duration_t> my_timings(n_repetitions);
for(int i = 0;i<n_repetitions;++i){
start = std::chrono::high_resolution_clock::now();
result = production_code::algorithm();
my_timings[i] = std::chrono::high_resolution_clock::now() - start;
}
//same with new_result = new_ideas::algorithm()
if(result == new_result){
std::ofstream ofile("results.csv");ofile.open();
for(int i = 0;i<n_repetitions;++i){
ofile << i << ",production," << prod_timings[i].count() << ",seconds" << std::endl;
}
//same with new_idea
ofile.close()
}
else
std::cout << "Never mind!" << std::endl;
}
... It's a simple Python interface around a blazing fast C++ library ...
from github.com/vincentlaucsb/csvmorph
... However C++ code used to be significantly faster for a long time, and also today still is in many cases.
T. Hoefler et al, "Scientific Benchmarking of Parallel Computing Systems - Twelve ways to tell the masses when reporting performance results",
SC '15 Proceedings, 2015
T. Hoefler et al, "Scientific Benchmarking of Parallel Computing Systems - Twelve ways to tell the masses when reporting performance results",
SC '15 Proceedings, 2015
Can't this be automated?
Question:
Are range-based for loops faster than integer based ones depending on the data type used?
#include <benchmark/benchmark.h>
#include <vector>
template <typename T>
double sum(const T* _data, std::size_t _len){
double value = 0;
for(std::size_t i = 0;i<_len;++i)
value += _data[i];
return value;
}
template <typename container_type>
double sum(const container_type& _data){
typedef typename container_type::value_type value_t;
double value = 0;
for(const value_t& el : _data)
value += el;
return value;
}
static void BM_integer_index(benchmark::State& state) {
const std::size_t len = 1 << 20;
std::vector<int> values(len,0.f);
double result = 0;
for (auto _ : state){
benchmark::DoNotOptimize(result = sum(values.data(), len));
}
}
// Register the function as a benchmark
BENCHMARK(BM_integer_index);
static void BM_range_based(benchmark::State& state) {
const std::size_t len = 1 << 20;
std::vector<int> values(len,0.f);
double result = 0;
for (auto _ : state){
benchmark::DoNotOptimize(result = sum(values));
}
}
BENCHMARK(BM_range_based);
BENCHMARK_MAIN();
Run on (4 X 3600 MHz CPU s)
2017-11-08 10:24:43
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
BM_integer_index 922920 ns 915531 ns 764
BM_range_based 937344 ns 929681 ns 768
template <typename T>
static void BM_integer_index(benchmark::State& state) {
const std::size_t len = state.range(0);
std::vector<T> values(len,0.f);
double result = 0;
for (auto _ : state){
benchmark::DoNotOptimize(result = sum(values.data(), len));
}
}
BENCHMARK_TEMPLATE(BM_integer_index,int)
->Arg(64)
->Arg(512)
->Arg(1 << 10)
->Arg(128<<10)
->Arg(1<<20)
->Arg(128<<20);
BENCHMARK_TEMPLATE(BM_integer_index,float)
->Arg(64)
->Arg(512)
->Arg(1 << 10)
->Arg(128<<10)
->Arg(1<<20)
->Arg(128<<20);
BENCHMARK_MAIN();
multiple arguments are also supported
BENCHMARK_TEMPLATE(BM_integer_index,int)
//42 is the initial value of the reduced sum
->Arg({64, 42})
//..
;
workflow:
Run on (4 X 3600 MHz CPU s)
2017-11-08 10:25:27
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_integer_index<int>/64 50 ns 50 ns 10000000
BM_integer_index<int>/512 424 ns 423 ns 1634359
BM_integer_index<int>/1024 864 ns 861 ns 772101
BM_integer_index<int>/131072 112125 ns 111777 ns 6344
BM_integer_index<int>/1048576 924382 ns 916717 ns 761
BM_integer_index<int>/134217728 123700290 ns 122614766 ns 6
BM_integer_index<float>/64 52 ns 51 ns 13374011
BM_integer_index<float>/512 427 ns 425 ns 1641151
BM_integer_index<float>/1024 863 ns 860 ns 772552
BM_integer_index<float>/131072 111322 ns 111160 ns 6109
BM_integer_index<float>/1048576 914593 ns 909174 ns 763
BM_integer_index<float>/134217728 122954355 ns 122219776 ns 6
BM_range_based<int>/64 50 ns 50 ns 13580124
BM_range_based<int>/512 429 ns 429 ns 1570356
BM_range_based<int>/1024 866 ns 865 ns 815042
BM_range_based<int>/131072 111289 ns 111139 ns 6316
BM_range_based<int>/1048576 912475 ns 907277 ns 761
BM_range_based<int>/134217728 122509880 ns 121832332 ns 6
BM_range_based<float>/64 48 ns 48 ns 13944707
BM_range_based<float>/512 426 ns 426 ns 1637659
BM_range_based<float>/1024 863 ns 862 ns 810743
BM_range_based<float>/131072 110915 ns 110775 ns 6343
BM_range_based<float>/1048576 917501 ns 912365 ns 735
BM_range_based<float>/134217728 122908318 ns 122219268 ns 6
$ cd src/libbenchmark
$ //install libbenchmark & tidyverse R package
$ CXXFLAGS=-O2 make report
nickbruun/hayai | 229 | - based on googletest - random order - fixture support |
- no csv output - online docs? commit activity? - no donotoptimize - max/min/means reported by default |
DigitalInBlue/celero | 249 | - no dependencies - baseline - fixture support |
- no csv output - means reported by default |
nonius.io | 49-194 | - header-only - depends on boost - super statistics summary |
- confusing repo structure - buggy example(s) - confidence intervals fixed to normal distribution - no donotoptimize |
google/benchmark | 1985 | - no dependencies - feature rich |
- templated versus fixture based setup - means reported by default |
gearshifft FFT benchmark
Take a balloon:
Use Tools to check the lay of the land.
Falsify the rubber duck:
Profile and check your hypothesis.
Survive the dungeon
With automated ensemble based benchmarks.
C++ is a language considered "fast".
We, the community, need to live up to this standard and have robust and reproducible performance numbers!
Thank you for your attention!