Peter Steinbach, September 13, 2018, Dresden, Germany



These slides are open-source:

|
|
Convolutions | Matrix Operations |
Convolutions
\(y_i = \sum_{n = 0}^{N_k} x_{i+/-n}*k_{i+/-n} \)
Matrix Operations
\(AB=Y, y_{ij} = \sum_{k} a_{ik} * b_{kj} \)
Common?
Dot Product Structure!

Low Latency Matters Most
Latency Hiding


Nvidia’s stock pricing in the last years

Executive to decide what to buy new hardware
Developer to compare framework performance
User wanting to e.g. classify images (fast/slow?)
(deep learning) applications try to solve a problem
model written in a particular software framework
running on particular hardware for training and/or inference
| Fix at least 2 of 3 from above! |
open-source and community driven
key requirement: reach fixed accuracy for training for fixed dataset
| data from 1 run only |
| submitter can choose model implementation |
| data inconsistent (K80 cloud-only, P100 bare-only) |
open-source and community driven
industry support (AMD, Google, Intel, …)
goal: SPEC benchmark for Deep Learning
| data = best of 5 runs |
usable benchmark with clear semver support
model code is fixed
ResNet (v1, v2), CARE Denoising network
Keras+TensorFlow or just TensorFlow
single and multi-gpu training (distributed planned)
data will be open-sourced once I find a sponsor
local cluster: Taurus at Technical University Dresden
single GPU node:
Intel Xeon E5-2680 v3 12c
64GB RAM
4x Nvidia Tesla K80 GPU
local servers (Nvidia Titan Xp, Nvidia Tesla P100)

singularity container = Keras 2.1.5 + TensorFlow 1.3.0


GCE, single K80 instance, 1vCPU, 6GB RAM, 10GB disk

deep learning requires a lot of parallel compute power
GPUs et al are indispensible tools
hardware/framework landscape diverse
solid benchmarks save time & money