Peter Steinbach
steinbach@scionics.de, @psteinb_
Need Help with Data Analysis, Machine Learning, Parallelisation or Performance Engineering ?
Get in Touch!
Max Planck Institute of Molecular Cell Biology and Genetics (MPI CBG), 500 staff
Center for Systems Biology, (CSBD) 150 staff
HPC
Performance
from openSPIM
Commercial : Zeiss Lightsheet Z1
Custom : Xscope by Nicola Maghelli et al (MPI CBG)
120-240 MB/s for 24/7
500-1024 MB/s for 24/7
by John, CC BY-SA 2.0
TOBEORNOTTOBEORTOBEORNOT#
'#' : 0x0 {0}
'A' : 0x1 {1}
'B' : 0x2 {2}
...
'Z' : 0x1a {26}
TOBEORNOTTOBEORTOBEORNOT#
^
written:
{20}
dictionary:
'#' : 0x0 {0}
'A' : 0x1 {1}
...
'T' : 0x14 {20}
...
'Z' : 0x1a {26}
'TO' : 0x1b {27}
TOBEORNOTTOBEORTOBEORNOT#
^
written:
{20}{15}
dictionary:
'#' : 0x0 {0}
'A' : 0x1 {1}
...
'T' : 0x14 {20}
...
'TO' : 0x1b {27}
'OB' : 0x1c {28}
TOBEORNOTTOBEORTOBEORNOT#
^
written:
{20}{15}{2}
dictionary:
'#' : 0x0 {0}
'A' : 0x1 {1}
...
'B' : 0x2 {2}
...
'OB' : 0x1c {28}
'BE' : 0x1d {29}
TOBEORNOTTOBEORTOBEORNOT#
^
written:
{20}{15}{2}{5}
dictionary:
'#' : 0x0 {0}
'A' : 0x1 {1}
...
'E' : 0x5 {5}
...
'BE' : 0x1d {29}
'EO' : 0x1e {30}
TOBEORNOTTOBEORTOBEORNOT#
^
written:
{20}{15}{2}{5}{15}{18}{14}{15}{20}{27}
dictionary:
'#' : 0x0 {0}
'A' : 0x1 {1}
...
'TE' : 0x1b {27}
...
'TT' : 0x23 {35}
'TOB': 0x24 {36}
TOBEORNOTTOBEORTOBEORNOT#
{20}{15}{2}{5}{15}{18}{14}{15}{20}{27}{29}{31}{36}{30}{32}{34}{0}
'#' : 0x0 {0}
'A' : 0x1 {1}
...
'EOR': 0x28 {40}
'RNO': 0x29 {41}
lz4
/dev/shm $ time lz4 spim_sample.tif
Compressed filename will be : mit_sample.tif.lz4
Compressed 423637504 bytes into 302613798 bytes ==> 71.43%
lz4 spim_sample.tif 1.28s user 0.18s system 99% cpu 1.470 total
405MB file, 289MB encoded, 316 MB/s ingest
zstd
/dev/shm $ time zstd spim_sample.tif
mit_sample.tif : 44.11% (423637504 => 186867090 bytes, mit_sample.tif.zst)
zstd spim_sample.tif 3.96s user 0.16s system 104% cpu 3.936 total
405MB file, 179MB encoded, 102 MB/s ingest
Good Luck!
Original (6 pixel values of 16 bit)
9 1 2 12 56013 36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110
Bitplane 0
9 1 2 12 56013 36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110
^ ^ ^ ^ ^ ^
-> 000011
Bitplane 15
9 1 2 12 56013 36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110
^ ^ ^ ^ ^ ^
-> 110010
On the command-line:
$ sqy encode -p 'bitswap1->lz4' my.tif
From Java:
final Pointer<Byte> bPipelineName = Pointer.pointerToCString("bitswap->lz4");
SqeazyLibrary.SQY_PipelineEncode_UI16(bPipelineName,lSourceBytes,
lSourceShape,3,
lCompressedBytes,lPointerToDestinationLength,
1)
Internal C++:
auto pipe = sqeazy::dynamic_pipeline<std::uint16_t>::from_string("bitswap1->lz4");
char* encoded_end = pipe.encode(input.data(),
encoded.data(),
shape);
original: 140MB, lz4-only: 114MB, bitshuffle+lz4: 60MB
template <
typename raw_t,
template<typename = raw_t> class filter_factory_t = default_filter_factory,
typename inbound_sink_factory_t = default_sink_factory<raw_t>,
typename optional_tail_factory_t = void
>
struct dynamic_pipeline
{
std::vector<std::shared_ptr<base_stage<raw_t> > > stages;
}
out_type* dynamic_pipeline::encode(const in_type* raw, out_type* encoded, shape_t shape){
header_t hdr(in_type(), shape, this->name());
char* start_here = std::copy(hdr.c_str(),hdr.c_str()+hdr.size(),
static_cast<char*>(encoded));
for( stage_t stage : stages ){
stage.encode(raw,encoded,shape);
std::swap(raw,encoded);
}
}
template <typename T>
using unique_array = std::unique_ptr<T[], boost::alignment::aligned_delete>;
out_type* dynamic_pipeline::encode(const in_type* raw, out_type* encoded, shape_t shape){
std::future<unique_array<incoming_t>> temp = std::async(make_aligned<incoming_t>,
std::size_t(32),
scratchpad_bytes);
header_t hdr(in_type(), shape, this->name());
char* start_here = std::copy(hdr.c_str(),hdr.c_str()+hdr.size(),
static_cast<char*>(encoded));
encoded = temp.get();
for( int s = 0; s< stages.size();++s){
stage.encode(raw,encoded,shape);
std::swap(raw,encoded);
}
}
Portable Performance as same performance on every system is impossible

Honest Performance
From blosc tutorial:
Often the L2 cache size (e.g. 256kB for an Intel Haswell) is a good starting point for optimization.

static const bool has_sse2 = compass::compiletime::has<compass::feature::sse2>::value);
if(has_sse2)
{
do_magic_with_sse2();
}
auto has_avx2 = compass::runtime::has(compass::feature::avx2());
if(has_avx2)
{
do_magic_with_avx2();
}
auto L2_in_kb = compass::runtime::size::cache::level(2);
foo.set_blocksize(L2_in_kb*.75)
Run on (4 X 3600 MHz CPU s)
2018-05-14 17:37:29
***WARNING*** CPU scaling is enabled, the benchmark real time ...
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_compass_sse4_1 31 ns 31 ns 22705074
BM_cpu_features_sse4_1 242 ns 241 ns 2870098
Competition (google/cpu_features) is hard, but not unbeatable!
Always prefer implicit over explicit parallelisation techniques.
Clay Breshears, The Art of Concurrency, O'Reilly, 2009
OpenMP
C++17 parallel TS
std::vector<T> a(size), b(size);
#pragma omp parallel for num_threads(n) static(chunksize=42)
for(int i = 0;i < size;++i)
a[i] = foo(b[i]);
std::transform( std::par, b.cbegin(), b.cend(),
a.begin(),
[]( auto & el){ return foo(el);}
);
Executors in C++20?
compression is a must have for 21st century science and data services
Modern C++11/14/17 is the goto tool for high performance applications
Ecosystem needs more flexible tooling to adapt to ever changing hardware
C++ needs to come with it's own portable batteries!
P Steinbach, High-bandwidth 3D Image Compression to Boost Predictive Life Sciences, 2017
$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg
