Compression and the pain of performance portability

Peter Steinbach
(Scionics Computer Innovation GmbH)
steinbach@scionics.de, @psteinb_

May 16, 2018, Advanced Developers Conference (Burghausen)

Agenda

For the next 60 Minutes ...

Peter who?
3D Microscopy
Compression
Sqeazy
Portable Performance

Disclaimer

github.com/psteinb/pain-of-performance-portability

About me

Scionics Computer Innovation GmbH

founded in 2001
located in Dresden, Saxony (Germany)
currently 35 staff
consulting and software products for industry and academia

Need Help with Data Analysis, Machine Learning, Parallelisation or Performance Engineering ?
Get in Touch!

Max-Planck Institute for Molecular Cell Biology and Genetics

Max Planck Institute of Molecular Cell Biology and Genetics (MPI CBG), 500 staff

Center for Systems Biology, (CSBD) 150 staff

My Role

HPC

Performance

by Sergey Ignatchenko

3D Microscopy

Selective Plane Illumination

Living Embryo Development

from openSPIM

Living Organism

from A. Bassi et al, Optical tomography complements light sheet microscopy for in toto imaging of zebrafish development

Innovation = Challenges

Commercial : Zeiss Lightsheet Z1

Custom : Xscope by Nicola Maghelli et al (MPI CBG)

120-240 MB/s for 24/7

500-1024 MB/s for 24/7

Big Data!

by John, CC BY-SA 2.0

Compression

Lempel–Ziv–Welch Algorithm

dictionary based losslessl compression
at the heart of many compression algorithms today
DEFLATE = LZW + huffman encoding
ZIP, PNG, TIFF ...

Example


TOBEORNOTTOBEORTOBEORNOT#

alphabet of 26+1 characters
(capital letters + stop code #)
alphabet can be represented by 2⁵ values


'#' : 0x0 {0}
'A' : 0x1 {1}
'B' : 0x2 {2}
...
'Z' : 0x1a {26}

LZW 1


TOBEORNOTTOBEORTOBEORNOT#
^

TO unknown, add to dict
T known, emit 20

written:


{20}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'T'  : 0x14 {20}
...
'Z'  : 0x1a {26}
'TO' : 0x1b {27}

LZW 2


TOBEORNOTTOBEORTOBEORNOT#
 ^

OB unknown, add to dict
O known, emit 15

written:


{20}{15}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'T'  : 0x14 {20}
...
'TO' : 0x1b {27}
'OB' : 0x1c {28}

LZW 3


TOBEORNOTTOBEORTOBEORNOT#
  ^

BE unknown, add to dict
B known, emit 2

written:


{20}{15}{2}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'B'  : 0x2  {2}
...
'OB' : 0x1c {28}
'BE' : 0x1d {29}

LZW 4


TOBEORNOTTOBEORTOBEORNOT#
   ^

EO unknown, add to dict
E known, emit 5

written:


{20}{15}{2}{5}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'E'  : 0x5  {5}
...
'BE' : 0x1d {29}
'EO' : 0x1e {30}

LZW 10


TOBEORNOTTOBEORTOBEORNOT#
         ^

TOB unknown, add to dict
TO known, emit 27
1 symbol, 2 characters

written:


{20}{15}{2}{5}{15}{18}{14}{15}{20}{27}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'TE' : 0x1b {27}
...
'TT' : 0x23 {35}
'TOB': 0x24 {36}

LZW Final


TOBEORNOTTOBEORTOBEORNOT#
{20}{15}{2}{5}{15}{18}{14}{15}{20}{27}{29}{31}{36}{30}{32}{34}{0}

original:
25 symbols × 5 b/symbol = 125 b
encoded :
(6 codes × 5 b/code) + (11 codes × 6 b/code) = 96 b


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'EOR': 0x28 {40}
'RNO': 0x29 {41}

LZ4 and friends

upspur of new and fast compression libraries in the last years
- lz4 by Yann Collet
- zstd by Yann Collet (Facebook)
- brotli by google ...

On our 16bit data?

lz4


/dev/shm $ time lz4 spim_sample.tif                              
Compressed filename will be : mit_sample.tif.lz4 
Compressed 423637504 bytes into 302613798 bytes ==> 71.43%                     
lz4 spim_sample.tif  1.28s user 0.18s system 99% cpu 1.470 total

405MB file, 289MB encoded, 316 MB/s ingest

zstd


/dev/shm $ time zstd spim_sample.tif
mit_sample.tif       : 44.11%   (423637504 => 186867090 bytes, mit_sample.tif.zst) 
zstd spim_sample.tif  3.96s user 0.16s system 104% cpu 3.936 total

405MB file, 179MB encoded, 102 MB/s ingest

Sqeazy

Requirements for sqeazy.github.io

provide compression at 500 MB/s or more
target:
- lossless: 3x or more
- lossy: 10x or more
flexible pipeline definition
support video codecs (x264, x265)
support community file formats like HDF5

support 16 and 8-bit data types
multi-core
x86
Linux, macOS and Win7
redistributable binary
interface to Java

Good Luck!

Bitshuffle

Original (6 pixel values of 16 bit)


                9                 1                 2                12             56013             36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110

Bitplane 0


                9                 1                 2                12             56013             36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110
^                 ^                 ^                 ^                 ^                 ^
-> 000011

Bitplane 15


                9                 1                 2                12             56013             36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110
                ^                 ^                 ^                 ^                 ^                 ^
-> 110010

Pipelining

On the command-line:


$ sqy encode -p 'bitswap1->lz4' my.tif

From Java:


final Pointer<Byte> bPipelineName = Pointer.pointerToCString("bitswap->lz4");
SqeazyLibrary.SQY_PipelineEncode_UI16(bPipelineName,lSourceBytes,
                                      lSourceShape,3,
                                      lCompressedBytes,lPointerToDestinationLength,
                                      1)

Internal C++:


auto pipe = sqeazy::dynamic_pipeline<std::uint16_t>::from_string("bitswap1->lz4");
char* encoded_end = pipe.encode(input.data(),
                                encoded.data(),
                                shape);

original: 140MB, lz4-only: 114MB, bitshuffle+lz4: 60MB

Sqeazy Pipelines


template <
    typename raw_t,
    template<typename = raw_t> class filter_factory_t = default_filter_factory,
    typename inbound_sink_factory_t = default_sink_factory<raw_t>,
    typename optional_tail_factory_t = void
    >
  struct dynamic_pipeline
{

    std::vector<std::shared_ptr<base_stage<raw_t> > > stages;

}

A Need for Temporaries?


out_type* dynamic_pipeline::encode(const in_type* raw, out_type* encoded, shape_t shape){

   header_t hdr(in_type(), shape, this->name());
   char* start_here = std::copy(hdr.c_str(),hdr.c_str()+hdr.size(),
                                    static_cast<char*>(encoded));
                                    
   for( stage_t stage : stages ){
   
        stage.encode(raw,encoded,shape);
        std::swap(raw,encoded);
   
   }

}

Latency Hiding


template <typename T>
using unique_array = std::unique_ptr<T[], boost::alignment::aligned_delete>;

out_type* dynamic_pipeline::encode(const in_type* raw, out_type* encoded, shape_t shape){

   std::future<unique_array<incoming_t>> temp = std::async(make_aligned<incoming_t>,
                                                           std::size_t(32),
                                                           scratchpad_bytes);

   header_t hdr(in_type(), shape, this->name());
   char* start_here = std::copy(hdr.c_str(),hdr.c_str()+hdr.size(),
                                    static_cast<char*>(encoded));
   
   encoded = temp.get();
   for( int s = 0; s< stages.size();++s){
   
        stage.encode(raw,encoded,shape);
        std::swap(raw,encoded);
   
   }

}

Portable Performance

Perspectives and illusions

Portable Performance as same performance on every system is impossible

cache level volume(s) depends on price
memory system changes (bytes per flops)
clock counts, turbo boosts
instruction sets available
installed runtime library versions

Manage Expectations

Honest Performance

communicate hardware requirements
speak in units of the domain
(e.g. images per second, pixels per second)
give ranges
(e.g. algorithm can compress from 0.95 to 3x on our test data)
provide reproducible benchmarks
(at best which can be run by user)

Adaptive Algorithms?

From blosc tutorial:

Often the L2 cache size (e.g. 256kB for an Intel Haswell) is a good starting point for optimization.

compass

single-header library (thanks to pcpp)
easy drop-in to your project
detect hardware features at runtime

detect (some) compile-time features
enables sensible hardware specific defaults
no dependencies

compass features


static const bool has_sse2 = compass::compiletime::has<compass::feature::sse2>::value);
if(has_sse2)
{
  do_magic_with_sse2();
}


auto has_avx2 = compass::runtime::has(compass::feature::avx2());
if(has_avx2)
{
  do_magic_with_avx2();
}


auto L2_in_kb = compass::runtime::size::cache::level(2);
foo.set_blocksize(L2_in_kb*.75)

compass benchmark


Run on (4 X 3600 MHz CPU s)
2018-05-14 17:37:29
***WARNING*** CPU scaling is enabled, the benchmark real time ...
--------------------------------------------------------------
Benchmark                       Time           CPU Iterations
--------------------------------------------------------------
BM_compass_sse4_1              31 ns         31 ns   22705074
BM_cpu_features_sse4_1        242 ns        241 ns    2870098

Competition (google/cpu_features) is hard, but not unbeatable!

compass lessons learned

gcc/clang yield similar APIs (good!)
MSVC hard to control target hardware
MSVC hardly communicates compile targets in preprocessor flags
what the intel manual says != reality in many cases (detecting physical cores)
would love to see more C++ introspection activity (free memory, current CPU load)

parallelisation

Always prefer implicit over explicit parallelisation techniques.

Clay Breshears, The Art of Concurrency, O'Reilly, 2009

state of cross-platform implicit concurrency in C++

OpenMP

gcc 4+
clang 4+
MSVC 2

C++17 parallel TS

gcc ??
clang ??
MSVC

Code Bloat with C++17 ?


std::vector<T> a(size), b(size);

#pragma omp parallel for num_threads(n) static(chunksize=42)
for(int i = 0;i < size;++i)
    a[i] = foo(b[i]);


std::transform( std::par, b.cbegin(), b.cend(),
                          a.begin(), 
                          []( auto & el){ return foo(el);} 
              );

Executors in C++20?

Summary

Performance Portability?

compression is a must have for 21st century science and data services
Modern C++11/14/17 is the goto tool for high performance applications
Ecosystem needs more flexible tooling to adapt to ever changing hardware

C++ needs to come with it's own portable batteries!

Backup

Better Compression with video codecs?

P Steinbach, High-bandwidth 3D Image Compression to Boost Predictive Life Sciences, 2017

perf Reloaded with FlameGraphs


$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg

visualisation technique conceived by Brendan Gregg (Netflix)
seamless integration into perf, dtrace, systemtap, XCode Instruments, Lightweight Java Profiler, Microsoft Visual Studio profiles, ...
based on collected counter samples and the stacktrace they were collected in

Sqeazy Encoding as FlameGraph

(x axis) current stack level in alphabetical order

(y axis) number of samples in that stacktrace level

flamegraphs on Windows

based on Event Tracing for Windows
UIforETW to collect traces
very versatile profiler (all system, process-only)
can show much more information than flamegraphs