high-bandwidth 3D image compression to boost predictive life sciences

Peter Steinbach, Jeffrey Kelling (presenter)
(Scionics Computer Innovation GmbH, Helmholz-Zentrum Dresden Rossendorf)
steinbach@scionics.de

May 11, 2017

Before I start

Jeffrey != Peter

presenter

Jeffrey Kelling (HZDR)
Jeffrey Kelling (HZDR)

author

Peter Steinbach (Scionics)
Peter Steinbach (Scionics)

Scionics Who?

This Talk is

Outline

  1. Scientific Motivation

  2. Sqeazy library

  3. Results

Big Data Deluge in Systems Biology

SPIM

Selective Plane Illumination Microscopy
Selective Plane Illumination Microscopy

Biologists love this!

3D rendering of Drosophila embryogenesis time-lapse data reconstructed from 5 angles SPIM recording

credits to Pavel Tomancak (MPI CBG)

But ...

Design Draft of a modern SPIM microscope, credits Nicola Maghelli (MPI CBG, Myers lab)
Design Draft of a modern SPIM microscope, credits Nicola Maghelli (MPI CBG, Myers lab)
  • today:

    • each CMOS camera can record 850 MB/s of 16bit grayscale pixels
    • 2 cameras per scope, 1.7 GB/s
  • scientists would like to capture long timelapses 1-2 days (or more)

  • total data volume per 1-2 day capture:

150-300 TiB raw volume

= 57 - 114 kEUR in SSDs

IT to the rescue

Does that scale?

Sqeazy

Open-source Compression Library

Yet another compression library?

  • heart of sqeazy: pipeline mechanism
    • transform data so that it can be compressed best
    • use very good and fast encoders as end of the pipeline, e.g. zstd, lz4, blosc, ...
      use them, don't reinvent them!
  • do it fast! (multi-core, SIMD)

  • written in C++11 (soon C++14)

Can we do better?

3D in space = 2D in space + time!

  • multimedia industry and video codec research has worked in high-bandwidth/low-latency regime for years
  • reuse their expertise through free available codec libraries
  • currently looking into h264/MPEG-4 AVC and h265/hevc, others are possible

Challenge: SPIM data

  • raw data is encoded as grey16
  • pixel intensities occupy more than 8-bits
    mean +/- std = 11 +/- 3

Solution: Quantize data

  • lossy bucket based quantisation
    (16 -> 8 bits per pixel transformation)
  • quality loss minimal
  • 8-bit per channel encoding is the standard input for video codecs
  • bandwidth enough to take 8 cameras

ffmpeg

  • using ffmpeg framework to interface sqeazy to

    • support CPU and GPU based encoding/decoding

    • enable future directions to non-x86 platforms

    • Linux, macOS, Windows supported

  • steep learning curve for using libavcodec API

  • for this talk: ffmpeg 3.0.7

hardware accelerated codecs

  • our production environment: Windows (microscope) and Linux (HPC) based
  • rarely any single library supports hardware accelerated video encoding uniformly across platforms

  • ffmpeg+nvenc meets our production requirements

  • encapsulates external dependencies (easier comparison)

Results

benchmark platform

hardware

software

what I measured

  • simple workflow based on ffmpeg performed on all:

    1. quantize .tif images to YUV 4:2:0 with sqeazy (produce input.y4m)
    2. encode input.y4m video with ffmpeg (take time, input/output files in ramdisk)
    3. decode encoded.raw to obtain roundtrip.y4m
    4. compare quality of input.y4m and roundtrip.y4m


  • all timings based on /usr/bin/time if not stated otherwise
  • orchestration on our HPC infrastructure with snakemake

CPU only

  • x264 is fast, but doesn't provide high compression
  • x265 is slow, but does provide high compression

  • codec preset study ongoing with downstream analysis/processing

GPUs to the rescue?

compare timings

$ time ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 ...
$ nvprof --print-api-trace ffmpeg -i input.y4m -c:v nvenc_h264   ...
  • nvprof api trace: time delta from cuCtxCreate/cuCtxDestroy

  • nvenc codec consumes 30-50% of the ffmpeg process time only

  • ffmpeg induces quite some overhead on top of nvenc!

GPU enhanced encoding

  • here:
    cuCtxCreate/Destroy based timing
  • nvenc offers improved compression ratios in comparison to libx26{4,5} (preset definitions differ)
  • nvenc bandwidths are surprisingly low

Profiling details

$ nvprof ffmpeg -i input.y4m -c:v nvenc_h264 -preset llhp -2pass 0 -gpu 1 -y output.h264
  • to no surpise: nvenc encoding is bound by host-device transfers (90%)
Can it still be that slow?

GPU enhanced encoding (cont.)

  • here:
    timing from Nvidia Video SDK NvEncodeLowLatency
  • nvenc superior to libx26{4,5}
  • NvEncodeLowLatency timings:

    • exclude driver initialisation
    • exclude memory initialisation

Summary

high-bandwidth 3D image compression

  • tough business given modern CMOS cameras (around 1GB/s at 16bit greyscale)

  • multi-core implementations very competitive
    (either in compression ratio or speed)

    • many codecs available

    • manu configuration parameters

    • many bit ranges coming about (8,10,12 bits)

GPUs for 3D image compression?

  • nvenc through ffmpeg difficult to use/measure
    (memory traffic, implementation quality poor?)

  • raw nvenc API suitable for high-bandwidth compression

    • NvEncodeLowLatency timings ignores driver and memory initialisation
      (represents scenario of constant streaming/encoding)

    • nvenc API useful on the microscope only, i.e. in streaming mode
      (at best if compression pipeline is on the device as well)

    • PCIe bus apparently a bottleneck

Thank you!

For questions, concerns or suggestions:

Open an issue, please!