Peter Steinbach, September 13, 2018, Dresden, Germany
These slides are open-source:
Convolutions | Matrix Operations |
Convolutions
\(y_i = \sum_{n = 0}^{N_k} x_{i+/-n}*k_{i+/-n} \)
Matrix Operations
\(AB=Y, y_{ij} = \sum_{k} a_{ik} * b_{kj} \)
Common?
Dot Product Structure!
Low Latency Matters Most
Latency Hiding
Nvidia’s stock pricing in the last years
Executive to decide what to buy new hardware
Developer to compare framework performance
User wanting to e.g. classify images (fast/slow?)
(deep learning) applications try to solve a problem
model written in a particular software framework
running on particular hardware for training and/or inference
Fix at least 2 of 3 from above! |
open-source and community driven
key requirement: reach fixed accuracy for training for fixed dataset
data from 1 run only |
submitter can choose model implementation |
data inconsistent (K80 cloud-only, P100 bare-only) |
open-source and community driven
industry support (AMD, Google, Intel, …)
goal: SPEC benchmark for Deep Learning
data = best of 5 runs |
usable benchmark with clear semver support
model code is fixed
ResNet (v1, v2), CARE Denoising network
Keras+TensorFlow or just TensorFlow
single and multi-gpu training (distributed planned)
data will be open-sourced once I find a sponsor
local cluster: Taurus at Technical University Dresden
single GPU node:
Intel Xeon E5-2680 v3 12c
64GB RAM
4x Nvidia Tesla K80 GPU
local servers (Nvidia Titan Xp, Nvidia Tesla P100)
singularity container = Keras 2.1.5 + TensorFlow 1.3.0
GCE, single K80 instance, 1vCPU, 6GB RAM, 10GB disk
deep learning requires a lot of parallel compute power
GPUs et al are indispensible tools
hardware/framework landscape diverse
solid benchmarks save time & money