high-bandwidth 3D image compression to boost predictive life sciences

Peter Steinbach, Jeffrey Kelling (presenter)
(Scionics Computer Innovation GmbH, Helmholz-Zentrum Dresden Rossendorf)
steinbach@scionics.de

May 11, 2017

Before I start

Jeffrey != Peter

presenter

author

Scionics Who?

Scionics Computer Innovation GmbH

founded in 2000, Dresden (Germany)
service provider to the Max Planck Institute of Molecular Cell Biology and Genetics
- scientific computing facility
- IT infrastructure
- public relations
member of the GPU Center of Excellence (community of industrial and academic developers/scientists using GPUs)

This Talk is

github.com/psteinb/gtc2017

code snippets
presentation links
open an issue for questions

Outline

Scientific Motivation
Sqeazy library
Results

Big Data Deluge in Systems Biology

SPIM

Biologists love this!

3D rendering of Drosophila embryogenesis time-lapse data reconstructed from 5 angles SPIM recording

credits to Pavel Tomancak (MPI CBG)

But ...

Design Draft of a modern SPIM microscope, credits Nicola Maghelli (MPI CBG, Myers lab)

today:
- each CMOS camera can record 850 MB/s of 16bit grayscale pixels
- 2 cameras per scope, 1.7 GB/s
scientists would like to capture long timelapses 1-2 days (or more)
total data volume per 1-2 day capture:

150-300 TiB raw volume

= 57 - 114 kEUR in SSDs

IT to the rescue

Does that scale?

Sqeazy

Open-source Compression Library

Yet another compression library?

heart of sqeazy: pipeline mechanism
- transform data so that it can be compressed best
- use very good and fast encoders as end of the pipeline, e.g. zstd, lz4, blosc, ...
  use them, don't reinvent them!
do it fast! (multi-core, SIMD)
written in C++11 (soon C++14)

Can we do better?

3D in space = 2D in space + time!

multimedia industry and video codec research has worked in high-bandwidth/low-latency regime for years
reuse their expertise through free available codec libraries
currently looking into h264/MPEG-4 AVC and h265/hevc, others are possible

Challenge: SPIM data

raw data is encoded as grey16

pixel intensities occupy more than 8-bits
mean +/- std = 11 +/- 3

Solution: Quantize data

lossy bucket based quantisation
(16 -> 8 bits per pixel transformation)
quality loss minimal
8-bit per channel encoding is the standard input for video codecs
bandwidth enough to take 8 cameras

ffmpeg

using ffmpeg framework to interface sqeazy to
- support CPU and GPU based encoding/decoding
- enable future directions to non-x86 platforms
- Linux, macOS, Windows supported
steep learning curve for using libavcodec API
for this talk: ffmpeg 3.0.7

hardware accelerated codecs

our production environment: Windows (microscope) and Linux (HPC) based

rarely any single library supports hardware accelerated video encoding uniformly across platforms
ffmpeg+nvenc meets our production requirements
encapsulates external dependencies (easier comparison)

Results

benchmark platform

hardware

dual socket Intel Xeon E5-2680v3 (2x12c)
128GB DDR4 RAM
2x Nvidia GeForce GTX1080
CentOS 7.1
host CPU limited to 8 threads (production environment)

software

ffmpeg 3.0.7 (build instructions)
x264 (commit 90a61ec764)
x265 2.4
GNU gcc 6.3 (5.4 when CUDA is required)
Nvidia Media SDK v7.1.9
Nvidia driver 375.26
CUDA 8.0.61
snakemake 3.11.2 to orchestrate benchmarks

what I measured

simple workflow based on ffmpeg performed on all:
1. quantize .tif images to YUV 4:2:0 with sqeazy (produce input.y4m)
2. encode input.y4m video with ffmpeg (take time, input/output files in ramdisk)
3. decode encoded.raw to obtain roundtrip.y4m
4. compare quality of input.y4m and roundtrip.y4m

all timings based on /usr/bin/time if not stated otherwise
orchestration on our HPC infrastructure with snakemake

CPU only

x264 is fast, but doesn't provide high compression
x265 is slow, but does provide high compression
codec preset study ongoing with downstream analysis/processing

GPUs to the rescue?

compare timings

$ time ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 ...
$ nvprof --print-api-trace ffmpeg -i input.y4m -c:v nvenc_h264   ...

nvprof api trace: time delta from cuCtxCreate/cuCtxDestroy
nvenc codec consumes 30-50% of the ffmpeg process time only
ffmpeg induces quite some overhead on top of nvenc!

GPU enhanced encoding

here:
cuCtxCreate/Destroy based timing
nvenc offers improved compression ratios in comparison to libx26{4,5} (preset definitions differ)
nvenc bandwidths are surprisingly low

Profiling details

$ nvprof ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 -gpu 1 -y output.h264

to no surpise: nvenc encoding is bound by host-device transfers (90%)

Can it still be that slow?

GPU enhanced encoding (cont.)

here:
timing from Nvidia Video SDK NvEncodeLowLatency
nvenc superior to libx26{4,5}
NvEncodeLowLatency timings:
- exclude driver initialisation
- exclude memory initialisation

Summary

high-bandwidth 3D image compression

tough business given modern CMOS cameras (around 1GB/s at 16bit greyscale)
multi-core implementations very competitive
(either in compression ratio or speed)
- many codecs available
- manu configuration parameters
- many bit ranges coming about (8,10,12 bits)

GPUs for 3D image compression?

nvenc through ffmpeg difficult to use/measure
(memory traffic, implementation quality poor?)
raw nvenc API suitable for high-bandwidth compression
- NvEncodeLowLatency timings ignores driver and memory initialisation
  (represents scenario of constant streaming/encoding)
- nvenc API useful on the microscope only, i.e. in streaming mode
  (at best if compression pipeline is on the device as well)
- PCIe bus apparently a bottleneck

Thank you!

For questions, concerns or suggestions:

Open an issue, please!

github.com/psteinb/gtc2017