Compression and the pain of performance portability

Peter Steinbach
(Scionics Computer Innovation GmbH)
steinbach@scionics.de, @psteinb_

May 16, 2018, Advanced Developers Conference (Burghausen)

Agenda

For the next 60 Minutes ...

  1. Peter who?
  2. 3D Microscopy
  3. Compression
  4. Sqeazy
  5. Portable Performance

Disclaimer

About me

Scionics Computer Innovation GmbH

 

  • founded in 2001
  • located in Dresden, Saxony (Germany)
  • currently 35 staff
  • consulting and software products for industry and academia

 

Need Help with Data Analysis, Machine Learning, Parallelisation or Performance Engineering ?
Get in Touch!

Max-Planck Institute for Molecular Cell Biology and Genetics

My Role

HPC

Performance


by Sergey Ignatchenko

3D Microscopy

Selective Plane Illumination

CC SA 3.0 by JKrieger
CC SA 3.0 by JKrieger

Living Embryo Development

from openSPIM

Living Organism

from A. Bassi et al, Optical tomography complements light sheet microscopy for in toto imaging of zebrafish development

Innovation = Challenges

Commercial : Zeiss Lightsheet Z1

Custom : Xscope by Nicola Maghelli et al (MPI CBG)

120-240 MB/s for 24/7

500-1024 MB/s for 24/7

Big Data!

by John, CC BY-SA 2.0

Compression

Lempel–Ziv–Welch Algorithm

  • dictionary based losslessl compression
  • at the heart of many compression algorithms today
  • DEFLATE = LZW + huffman encoding
  • ZIP, PNG, TIFF ...

Example


TOBEORNOTTOBEORTOBEORNOT#

 

  • alphabet of 26+1 characters
    (capital letters + stop code #)
  • alphabet can be represented by 25 values

'#' : 0x0 {0}
'A' : 0x1 {1}
'B' : 0x2 {2}
...
'Z' : 0x1a {26}

LZW 1


TOBEORNOTTOBEORTOBEORNOT#
^

 

  • TO unknown, add to dict
  • T known, emit 20

written:


{20}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'T'  : 0x14 {20}
...
'Z'  : 0x1a {26}
'TO' : 0x1b {27}

LZW 2


TOBEORNOTTOBEORTOBEORNOT#
 ^

 

  • OB unknown, add to dict
  • O known, emit 15

written:


{20}{15}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'T'  : 0x14 {20}
...
'TO' : 0x1b {27}
'OB' : 0x1c {28}

LZW 3


TOBEORNOTTOBEORTOBEORNOT#
  ^

 

  • BE unknown, add to dict
  • B known, emit 2

written:


{20}{15}{2}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'B'  : 0x2  {2}
...
'OB' : 0x1c {28}
'BE' : 0x1d {29}

LZW 4


TOBEORNOTTOBEORTOBEORNOT#
   ^

 

  • EO unknown, add to dict
  • E known, emit 5

written:


{20}{15}{2}{5}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'E'  : 0x5  {5}
...
'BE' : 0x1d {29}
'EO' : 0x1e {30}

LZW 10


TOBEORNOTTOBEORTOBEORNOT#
         ^

 

  • TOB unknown, add to dict
  • TO known, emit 27
  • 1 symbol, 2 characters

written:


{20}{15}{2}{5}{15}{18}{14}{15}{20}{27}

dictionary:


'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'TE' : 0x1b {27}
...
'TT' : 0x23 {35}
'TOB': 0x24 {36}

LZW Final


TOBEORNOTTOBEORTOBEORNOT#
{20}{15}{2}{5}{15}{18}{14}{15}{20}{27}{29}{31}{36}{30}{32}{34}{0}

 

  • original:
    25 symbols × 5 b/symbol = 125 b
  • encoded :
    (6 codes × 5 b/code) + (11 codes × 6 b/code) = 96 b

'#'  : 0x0  {0}
'A'  : 0x1  {1}
...
'EOR': 0x28 {40}
'RNO': 0x29 {41}

LZ4 and friends

  • upspur of new and fast compression libraries in the last years
    • lz4 by Yann Collet
    • zstd by Yann Collet (Facebook)
    • brotli by google ...

quixdb.github.io/squash-benchmark

On our 16bit data?

lz4


/dev/shm $ time lz4 spim_sample.tif                              
Compressed filename will be : mit_sample.tif.lz4 
Compressed 423637504 bytes into 302613798 bytes ==> 71.43%                     
lz4 spim_sample.tif  1.28s user 0.18s system 99% cpu 1.470 total

405MB file, 289MB encoded, 316 MB/s ingest

 

zstd


/dev/shm $ time zstd spim_sample.tif
mit_sample.tif       : 44.11%   (423637504 => 186867090 bytes, mit_sample.tif.zst) 
zstd spim_sample.tif  3.96s user 0.16s system 104% cpu 3.936 total

405MB file, 179MB encoded, 102 MB/s ingest

Sqeazy

Requirements for sqeazy.github.io

  • provide compression at 500 MB/s or more
  • target:
    • lossless: 3x or more
    • lossy: 10x or more
  • flexible pipeline definition
  • support video codecs (x264, x265)
  • support community file formats like HDF5
  • support 16 and 8-bit data types
  • multi-core
  • x86
  • Linux, macOS and Win7
  • redistributable binary
  • interface to Java

Good Luck!

Bitshuffle

Original (6 pixel values of 16 bit)


                9                 1                 2                12             56013             36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110

Bitplane 0


                9                 1                 2                12             56013             36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110
^                 ^                 ^                 ^                 ^                 ^
-> 000011

Bitplane 15


                9                 1                 2                12             56013             36742
00000000 00001001 00000000 00000001 00000000 00000010 00000000 00001100 11011010 11001101 10001111 10000110
                ^                 ^                 ^                 ^                 ^                 ^
-> 110010

Pipelining

On the command-line:


$ sqy encode -p 'bitswap1->lz4' my.tif

From Java:


final Pointer<Byte> bPipelineName = Pointer.pointerToCString("bitswap->lz4");
SqeazyLibrary.SQY_PipelineEncode_UI16(bPipelineName,lSourceBytes,
                                      lSourceShape,3,
                                      lCompressedBytes,lPointerToDestinationLength,
                                      1)

Internal C++:


auto pipe = sqeazy::dynamic_pipeline<std::uint16_t>::from_string("bitswap1->lz4");
char* encoded_end = pipe.encode(input.data(),
                                encoded.data(),
                                shape);

original: 140MB, lz4-only: 114MB, bitshuffle+lz4: 60MB

Sqeazy Pipelines


template <
    typename raw_t,
    template<typename = raw_t> class filter_factory_t = default_filter_factory,
    typename inbound_sink_factory_t = default_sink_factory<raw_t>,
    typename optional_tail_factory_t = void
    >
  struct dynamic_pipeline
{

    std::vector<std::shared_ptr<base_stage<raw_t> > > stages;

}

A Need for Temporaries?


out_type* dynamic_pipeline::encode(const in_type* raw, out_type* encoded, shape_t shape){

   header_t hdr(in_type(), shape, this->name());
   char* start_here = std::copy(hdr.c_str(),hdr.c_str()+hdr.size(),
                                    static_cast<char*>(encoded));
                                    
   for( stage_t stage : stages ){
   
        stage.encode(raw,encoded,shape);
        std::swap(raw,encoded);
   
   }

}

Latency Hiding


template <typename T>
using unique_array = std::unique_ptr<T[], boost::alignment::aligned_delete>;

out_type* dynamic_pipeline::encode(const in_type* raw, out_type* encoded, shape_t shape){

   std::future<unique_array<incoming_t>> temp = std::async(make_aligned<incoming_t>,
                                                           std::size_t(32),
                                                           scratchpad_bytes);

   header_t hdr(in_type(), shape, this->name());
   char* start_here = std::copy(hdr.c_str(),hdr.c_str()+hdr.size(),
                                    static_cast<char*>(encoded));
   
   encoded = temp.get();
   for( int s = 0; s< stages.size();++s){
   
        stage.encode(raw,encoded,shape);
        std::swap(raw,encoded);
   
   }

}

Portable Performance

Perspectives and illusions

Portable Performance as same performance on every system is impossible

  • cache level volume(s) depends on price
  • memory system changes (bytes per flops)
  • clock counts, turbo boosts
  • instruction sets available
  • installed runtime library versions

Manage Expectations

CC0
CC0

Honest Performance

  • communicate hardware requirements
  • speak in units of the domain
    (e.g. images per second, pixels per second)
  • give ranges
    (e.g. algorithm can compress from 0.95 to 3x on our test data)
  • provide reproducible benchmarks
    (at best which can be run by user)

Adaptive Algorithms?

CC0
CC0

From blosc tutorial:

Often the L2 cache size (e.g. 256kB for an Intel Haswell) is a good starting point for optimization.

compass

  • single-header library (thanks to pcpp)
  • easy drop-in to your project
  • detect hardware features at runtime
  • detect (some) compile-time features
  • enables sensible hardware specific defaults
  • no dependencies

 

github.com/psteinb/compass
github.com/psteinb/compass

compass features


static const bool has_sse2 = compass::compiletime::has<compass::feature::sse2>::value);
if(has_sse2)
{
  do_magic_with_sse2();
}

auto has_avx2 = compass::runtime::has(compass::feature::avx2());
if(has_avx2)
{
  do_magic_with_avx2();
}

auto L2_in_kb = compass::runtime::size::cache::level(2);
foo.set_blocksize(L2_in_kb*.75)

compass benchmark


Run on (4 X 3600 MHz CPU s)
2018-05-14 17:37:29
***WARNING*** CPU scaling is enabled, the benchmark real time ...
--------------------------------------------------------------
Benchmark                       Time           CPU Iterations
--------------------------------------------------------------
BM_compass_sse4_1              31 ns         31 ns   22705074
BM_cpu_features_sse4_1        242 ns        241 ns    2870098

 

Competition (google/cpu_features) is hard, but not unbeatable!

compass lessons learned

  • gcc/clang yield similar APIs (good!)
  • MSVC hard to control target hardware
  • MSVC hardly communicates compile targets in preprocessor flags
  • what the intel manual says != reality in many cases (detecting physical cores)
  • would love to see more C++ introspection activity (free memory, current CPU load)

parallelisation

Always prefer implicit over explicit parallelisation techniques.

Clay Breshears, The Art of Concurrency, O'Reilly, 2009

state of cross-platform implicit concurrency in C++

 

OpenMP

  • gcc 4+
  • clang 4+
  • MSVC 2

C++17 parallel TS

  • gcc ??
  • clang ??
  • MSVC

Code Bloat with C++17 ?


std::vector<T> a(size), b(size);

#pragma omp parallel for num_threads(n) static(chunksize=42)
for(int i = 0;i < size;++i)
    a[i] = foo(b[i]);

std::transform( std::par, b.cbegin(), b.cend(),
                          a.begin(), 
                          []( auto & el){ return foo(el);} 
              );

 

Executors in C++20?

Summary

Performance Portability?

  • compression is a must have for 21st century science and data services

  • Modern C++11/14/17 is the goto tool for high performance applications

  • Ecosystem needs more flexible tooling to adapt to ever changing hardware

 

C++ needs to come with it's own portable batteries!

Backup

Better Compression with video codecs?

P Steinbach, High-bandwidth 3D Image Compression to Boost Predictive Life Sciences, 2017

perf Reloaded with FlameGraphs


$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg
  • visualisation technique conceived by Brendan Gregg (Netflix)
  • seamless integration into perf, dtrace, systemtap, XCode Instruments, Lightweight Java Profiler, Microsoft Visual Studio profiles, ...
  • based on collected counter samples and the stacktrace they were collected in

Sqeazy Encoding as FlameGraph

Your browser does not support SVG

  • (x axis) current stack level in alphabetical order
  • (y axis) number of samples in that stacktrace level

flamegraphs on Windows

  • based on Event Tracing for Windows
  • UIforETW to collect traces
  • very versatile profiler (all system, process-only)
  • can show much more information than flamegraphs