ROCm with ROCr, right?

Peter Steinbach
(Scionics Computer Innovation GmbH)
steinbach@scionics.de

March 30, 2017

Before I start

Scionics Who?

Why parallel2017?

Nvidia Tesla
AMD FirePro
Intel MIC

What should our clients choose?

Share experiences with ROCm/ROCr with community!

Why I present?

Accelerating our clients' scientific algorithms on GPUs
(multi-GB dataset, a lot of FFTs)

This Talk is

github.com/psteinb/parallel2017

This Talk != Advertisement

  • Our company is by no means financially tied to AMD nor any of it's resellers.
  • AMD provided test hardware and that's it
  • whatever I find missing or not working, I'll report it here

 

Use the issue tracker of this talk to correct me!

github.com/psteinb/parallel2017

Outline

  1. ROCm

  2. Porting Code from CUDA

  3. HC

ROCm

Radeon Open Compute Platform

  • very young:
    April 25th, 2016, version 1.0
  • 3 main components:

    • ROCm Linux kernel driver
    • ROCr runtime & library stack
    • HCC compiler based on LLVM
Open Source!

ROCm kernel driver

  • supported GPUs: GFX8 GPU's ( Fiji & Polaris Family)
  • supported CPUs:

    • Intel Xeon E3/E5, Core i3/5/7 Haswell or newer
    • (upcoming) AMD Naples/Ryzen
    • (upcoming) Cavium Thunder X ARM

 

  • large memory single allocation
    (>32GB in one pointer)
  • peer-to-peer Multi-GPU, RDMA
  • systems management API and tooling

ROCr runtime

  • AMD's implementation of HSA runtime
    (+ extensions for multi-GPU)
  • user mode queues
  • flat memory addressing
  • atomic memory transactions & signals
  • process concurrency & preemption
  • device discovery

Heterogenous Compute Compiler

  • hcc compiler for supported APIs:

    • OpenMP (OpenMP4 accelerator offloading in development)
    • HIP & HC
    • OpenCL
  • LLVM native GCN ISA code generation
  • offline compilation support
  • standardized loader and code object format
  • GCN ISA assembler and disassembler

Prologue

UoB-HPC/GPU-STREAM
UoB-HPC/GPU-STREAM

UoB-HPC/GPU-STREAM

/* copy  */ c[:]    = a[:]
/* mul   */ b[:]    = scalar*b[:]
/* add   */ c[:]    = a[:] + b[:]
/* triad */ a[:]    = b[:] + scalar*c[:] 
/* dot   */ scalar  = dot(a[:],b[:])

 

  • benchmark of various programming paradigms:
    OpenMP3, OpenMP4, CUDA, Kokkos, Raja, OpenCL, ...
  • for now *nix only

Porting Code from CUDA

Hipify

  • Convert CUDA to portable C++, hipify
  • C++ kernel language ( C++11/14/17 features )
  • C runtime API
  • same performance as native CUDA

 

  • supports most commonly used parts of CUDA:
    streams, events, memory (de-)allocation, profiling

  • produced apps have full tool support:
    • CUDA: nvcc, nvprof, nvvp
    • ROCM: hcc, rocm-prof, codexl

CUDA Example

__global__ void add_kernel(const T * a, 
                           const T * b, 
                           T * c){
  const int i = blockDim.x * blockIdx.x + threadIdx.x;
  c[i] = a[i] + b[i];}

void CUDAStream<T>::add(){
  add_kernel<<<array_size/TBSIZE, TBSIZE>>>(d_a, d_b, d_c);
  check_error();  //..
  }

Hip`ified Example

__global__ void add_kernel(hipLaunchParm lp, 
                           const T * a, const T * b, 
                           T * c){
  const int i = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
  c[i] = a[i] + b[i];
}

void HIPStream<T>::add(){
  hipLaunchKernel(HIP_KERNEL_NAME(add_kernel), 
                  dim3(array_size/TBSIZE), dim3(TBSIZE), 0, 0, 
                  d_a, d_b, d_c);  check_error();  //...
}

HIP summary

  • very interesting tool to get started with production or legacy code

  • still low-level CUDA programming

  • HIP library eco-system available: hipBlas, hipFFT, hipRNG, machine learning acceleration, ...

Heterogenous Compute

HC

  • C++ parallel runtime and API
  • based on C++AMP in hc namespace plus C++14
  • (asynchronous) copy commands for host-device i/o
  • explicit pointer-based memory allocation (am_alloc / am_free)
  • hc::accelerator_view, hc::array_view, hc::completion_future
  • device specific 'instrinsics' (wavefront shuffle, bit extraction, atomics)

 

Very similar to thrust, boost.compute, sycl.

HC API Overview

HC in GPU-STREAM, Declaration

#include "Stream.h"
#include "hc.hpp"

template <class T>
class HCStream : public Stream<T>
{
protected:
  unsigned int array_size;
  hc::array<T,1> d_a;
  hc::array<T,1> d_b;
  hc::array<T,1> d_c;
  //...

HC in GPU-STREAM, Init Data

template <class T>
void HCStream<T>::init_arrays(T _a, T _b, T _c)
{
    hc::array_view<T,1> view_a(this->d_a);
    hc::parallel_for_each(hc::extent<1>(array_size)
                                , [=](hc::index<1> i) [[hc]] {
                                  view_a[i] = _a;
                                });
    //...

HC in GPU-STREAM, Run Kernel

template <class T>
void HCStream<T>::add()
{
    hc::array_view<T,1> view_a(this->d_a);
    hc::array_view<T,1> view_b(this->d_b);
    hc::array_view<T,1> view_c(this->d_c);

    hc::parallel_for_each(hc::extent<1>(array_size)
                                , [=](hc::index<1> i) [[hc]] {
                                  view_c[i] = view_a[i]+view_b[i];
                                });

Let's compare the results

Comparing to HBM2 and GDDR5X

HC can be low-level too (WIP)

hc::parallel_for_each(tiled_ex,
                      [=,
                      &view_a,
                      &view_b,
                      &partial](const hc::tiled_index<1>& tidx) [[hc]] {

    auto gidx = tidx.global[0];
    T r = T{0}; // Assumes reduction op is addition.
    while (gidx < view_a.get_extent().size()) {
        r += view_a[gidx] * view_b[gidx]; //dot-product
        gidx += domain_sz;
    }

    tile_static T tileData[TBSIZE];
    tileData[tidx.local[0]] = r;

    tidx.barrier.wait_with_tile_static_memory_fence();

    for (auto h = TBSIZE / 2; h; h /= 2) {
        if (tidx.local[0] < h) {
            tileData[tidx.local[0]] += tileData[tidx.local[0] + h];
        }
        tidx.barrier.wait_with_tile_static_memory_fence();
    }

    if (tidx.global == tidx.tile_origin) partial[tidx.tile] = tileData[0];

Concurrency constructs

  • asynchronous operations (memory copies, kernel launches) return completion future
std::vector<float> payload  (/*pick a number*/);
hc::array<float,1> d_payload(payload.size());
hc::completion_future when_done = hc::async_copy(payload.begin(),
    payload.end(),
    d_payload);
when_done.then(call_kernel_functor); //continuation function!
  • for me hc::completion_future::then API not production ready yet:
template<typename functor >
void    then (const functor &func);//just a callback for now

Concurrency TS?

std::vector<hc::completion_future> streams(n);
for(hc::completion_future when_done : streams){

    when_done = hc::async_copy(payload_begin_itr,
                               payload_end_itr,
                               d_payload_view);
    when_done.then(parallel_for_each(/*do magic*/))
             .then(hc::async_copy(d_payload_view,result_begin_itr));
}

hc::when_all(streams);

concurrency constructs are the glue code of host-device interactions!

(see when_all, co_await and friends)

Summary

What I learned so far

  • AMD's ROCm/ROCr stack is a very young and ambitious project
  • full open-source driver, runtime and compiler for dGPU
  • hc API is expressive and reduces boiler-plate code (plans for C++17 on device)
  • tooling and documentation are not there yet for production (HPC) codes

 

Still an interesting approach to keep an eye on!

What I observe

  • CUDA/OpenCL as the community's working horse are low-level and enforce a lot of boiler plate
  • thrust, boost.compute, sycl, hc encapsulate this
    (sometimes at the expense of feature parity)
  • C++17 parallelism extensions and C++20 concurrency good for multi-core, why not for dGPUs?

My Hopes and Acks

Thank you for your attention!
(Thanks to and for their valuable feedback!)