The Performance Addict's Toolbox

Thomas Röhl
(HPC group of University Erlangen-Nuremburg)
Thomas.Roehl@fau.de

Peter Steinbach
(Scionics Computer Innovation GmbH)
steinbach@scionics.de

Mar 8, 2018, para//el2018

Real-life performance optimisation is never

as simple as this

more like this

by Giulia Zoccarato, DensityDesign Research Lab, CC-SA 4.0

Agenda

  1. Motivation
  2. Who-are-we
  3. Performance outside-in
  4. Performance inside-out
  5. Benchmarks and how to create them

Who-are-we

Who am I?

HPC group at Regional Computing Center Erlangen
Prof. Dr. Gerhard Wellein

  • support users on our HPC infrastructure
  • software and research projects related to
    node-level performance engineering
  • developer of the Execution-Cache-Memory Model (refinement of the Roofline model)
  • tool development

Thomas Röhl

  • topics: performance engineering, node-level optimisation
  • main developer of LIKWID, a tool suite for performance oriented programmers

The university

Main building of University Erlangen-Nuremburg
Main building of University Erlangen-Nuremburg

www.fau.de

Foto by Selby, CC-SA 3.0

Friedrich-Alexander University
Erlangen-Nuremburg

  • ≈ 40000 students
  • ≈ 250 majors
  • Top3 @ Bavaria, Top 10 @ Germany
Location of Erlangen

Disclaimer

Original talk by Peter Steinbach
(Scientific Software Engineer,
Scionics Computer Innovation GmbH, Dresden)


Scionics provides data-driven consulting in:
  • Image Analysis
  • Code Optimisation (HPC,Multi-Core,GPU)
  • Large Data Life Cycle
  • Bioinformatics

Before I begin

All of my slides assume, that the code provides correct results!

Nobody wants fast code, that is wrong!

Performance Outside-In

One day as a Performance Engineer

Alan O'Rourke, Too Busy To Improve - Performance Management - Square Wheels, CC
Alan O'Rourke, Too Busy To Improve - Performance Management - Square Wheels, CC BY 2.0

Once in a while

From: doe@theinstitute.de
Subject: Cluster is slow
Date: Fri, 20 Oct 2017 12:03:21 +0200
To: hpcsupport@theinstitute.de

Hi,

what is going on with the cluster? My application is running
slow since yesterday.
Could you have a look at it please?

Thanks,
John

Challenge: Finding the performance regression without looking at the code

High Level Overview

htop, free et al
htop, free et al

add hardware counter data: PCM

Reference Numbers

$ dd if=/dev/zero of=/tmp/just_zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 2.94478 s, 729 MB/s

$ dd if=/dev/zero of=/dev/shm/2gb.zeros bs=1G count=2
2+0 records in
2+0 records out
2147483648 bytes (2.1 GB) copied, 1.14782 s, 1.9 GB/s

 

What can your hardware typically do?

dd, ior, memhog, stream, LIKWID, ...

]

Profile with perf

$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf report --stdio
no symbols found in ./my-slow-binary, maybe install a debug package?
# ...
# Total Lost Samples: 0
#
# Samples: 75  of event 'cycles:u'
# Event count (approx.): 1839654
#
# Children      Self  Command  Shared Object      Symbol           
# ........  ........  .......  .................  ................
#
    20.18%    20.18%  dd       [kernel.kallsyms]  [k] page_fault
            |          
             --19.77%--0
                       _int_realloc
                       page_fault
  • lightweight sample based profiling
  • per task, per CPU and per-workload counters
  • sample CPU performance counters, tracepoints or system probes
  • on windows: xperf/UIforETW

perf: what is a callstack?

Your browser does not support SVG

perf: sampling based profiling

Your browser does not support SVG

  • for every sampling event:
    • record call stack
    • query hardware counters, e.g. cpu-cycles
  • sampling must not be accurate

perf Reloaded with FlameGraphs

$ perf record -g ./my-slow-binary
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.023 MB perf.data (75 samples) ]
$ perf script > out.perf
$ ./stackcollapse-perf.pl out.perf > out.folded
$ ./flamegraph.pl out.folded > perf_samples.svg
  • visualisation technique conceived by Brendan Gregg (Netflix)
  • seamless integration into perf, dtrace, systemtap, XCode Instruments, Lightweight Java Profiler, Microsoft Visual Studio profiles, ...
  • based on collected counter samples and the stacktrace they were collected in

Ethereum Mining as FlameGraph

Your browser does not support SVG

  • (x axis) current stack level in alphabetical order
  • (y axis) number of samples in that stacktrace level

HPC user's slow application

Your browser does not support SVG

Bottom Line

Taking a balloon to get an overview of performance bottlenecks is possible.

Performance Inside-Out

High Diversity of Tools!

valgrind + kcachegrind

Profile from Peter Gottschling's example on vector unrolling.

Simple Graphical output, perftools

Profile from Peter Gottschling's example on vector unrolling.

Using flamegraphs, hotspot

xray

$ CXX=clang++ make

$ XRAY_OPTIONS="patch_premain=true xray_mode=xray-basic verbosity=1" 
$ ./vector_unroll_example
==31936==XRay: Log file in 'xray-log.vector_unroll_example.ju4PNk'
Compute time native loop is 0.159 micros.
u[0] is 15
#...

$ llvm-xray account xray-log.vector_unroll_example.ju4PNk -instr_map=./vector_unroll_example
nctions with latencies: 5
 funcid      count [      min,       med,       90p,  ...]       sum  function    
      1          1 [ 7.338530,  7.338530,  7.338530,  ...]  7.338530  <invalid>:0:0: main
      2       1275 [ 0.000005,  0.000011,  0.000012,  ...]  0.013064  <invalid>:0:0: void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)

Integration in other tools started: Vampir

Proprietary tools

Found a hot spot!

Danger zone of mental models

Inspect Assembly?

perf for hardware exploration?

$ perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  #...
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  #...
  • perf event list depends on kernel version
  • hardware counters are not portable (specification change by vendors)
  • alternative: ocperf

Test hypothesis with likwid

  • performance monitoring and benchmarking suite
  • exploration through hardware monitoring
  • Linux only, GPLv3
  • Main focus x86, ARM and POWER upcoming
  • profiling hardware counters
    (consistent eventset names for portability)
  • supports almost all provided counters and events
  • instrumentation API for C, C++, java and python

use case: Index Lists

#include <vector>
#include "omp.h"

struct item{
    std::vector<float> position, momentum;
    std::vector<int>   nearest_neighbors;}

int main(int argc, char** argv){
    std::vector<item> world = generate(argc*10e6);
    
    for(int& time_step : timelapse){
        update(world);
        
        #pragma omp parallel for
        for(item& it : world){
            for(int& index : it.nearest_neighbors){
                auto distance = calculate(it, world[index]);
                if(distance > threshold)
                    it.nearest_neighbors.remove(index);
            }}}
    //..
}
  • hypotheses:

    • large 'unpredictable' jumps in memory access diminishes cache bandwidth

    • false sharing forces cache line reloads as read-only and writable items may share the same cache line

Let's measure!

use case: Through Likwid

Use Case

$ export OMP_NUM_THREADS=1
$ path/to/likwid-perfctr -C 0 -g FALSE_SHARE ./my_app
+----------------------------------+--------------+
|              Metric              |    Core 0    |
+----------------------------------+--------------+
|  Local LLC false sharing [MByte] |       0.0008 |
|   Local LLC false sharing rate   | 5.608215e-10 |
+----------------------------------+--------------+

$ export OMP_NUM_THREADS=4
$ path/to/likwid-perfctr -C 0-4 -g FALSE_SHARE ./my_app
+---------------------------------------+--------------|
|                 Metric                |      Sum     |
+---------------------------------------+--------------|
|  Local LLC false sharing [MByte] STAT |    2973.7637 |
|   Local LLC false sharing rate STAT   |       0.0081 |
+---------------------------------------+--------------|

stream benchmark as reference

$ export OMP_NUM_THREADS=1
$ path/to/likwid-perfctr -C 0 -g FALSE_SHARE ./stream
+----------------------------------+--------------+
|              Metric              |    Core 0    |
+----------------------------------+--------------+
|  Local LLC false sharing [MByte] |       0.0006 |
|   Local LLC false sharing rate   | 6.057282e-10 |
+----------------------------------+--------------+

$ export OMP_NUM_THREADS=4
$ path/to/likwid-perfctr -C 0-4 -g FALSE_SHARE ./stream
+---------------------------------------+--------------|
|                 Metric                |      Sum     |
+---------------------------------------+--------------|
|  Local LLC false sharing [MByte] STAT |       0.1067 |
|   Local LLC false sharing rate STAT   | 4.080027e-07 |
+---------------------------------------+--------------|

Warning: some events provide only qualitative but not quantitative results

Interested in just a code region?

LIKWID MarkerAPI (alternative PAPI)

Instrumented version of stream's triad

#include "omp.h"
#include "likwid.h"

#define ITER 100
#define SIZE 40000000
int main(int argc, char* argv[])
{
    LIKWID_MARKER_INIT;
    //allocate and initialize a, b, c and d
#pragma omp parallel
{
    for (int k = 0; k < ITER; k++)
    {
        LIKWID_MARKER_START("triad");
#pragma simd
#pragma omp for
        for (int j = 0; j < SIZE; j++)
        {
            a[j] = b[j] + c[j] * d[j];
        }   
        LIKWID_MARKER_STOP("triad");
    }
}
    // check result in a and free a, b, c and d
    LIKWID_MARKER_CLOSE;
    return 0;
}

  • Add MarkerAPI calls around code regions
  • Link with LIKWID lib and -DLIKWID_PERFMON
  • Specify events and -m on perfctr cmdline
  • Results are grouped by region string
$ likwid-perfctr -C 0 -g L3 -m ./a.out
[...]
Region triad, Group 1: L3
+-------------------+----------+
|    Region Info    |  Core 0  |
+-------------------+----------+
| RDTSC Runtime [s] | 0.846960 |
|     call count    |      100 |
+-------------------+----------+
[...raw counter values...]
+-------------------------------+------------+
|             Metric            |   Core 0   |
+-------------------------------+------------+
|      Runtime (RDTSC) [s]      |     0.8470 |
|      Runtime unhalted [s]     |     0.8451 |
|  L3 load bandwidth [MBytes/s] | 15121.3486 |
| L3 evict bandwidth [MBytes/s] |  3298.9553 |
|    L3 bandwidth [MBytes/s]    | 18420.3040 |
+-------------------------------+------------+
[...more results...]

Bottom Line

  • excellent tools available to find hot spots
  • once "found", talk to someone
    (rubber duck or colleaque(s))
  • create falsifiable hypotheses
  • MEASURE!

Benchmarks and how to create them

First, take the time

Compteur de Tierces by Louis Moinet, CC-BY 3.0
Compteur de Tierces by Louis Moinet, CC-BY 3.0

chrono is your friend

#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"

int main(int argc, char** argv){

    auto start = std::chrono::high_resolution_clock::now();
    auto result = production_code::algorithm();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    auto new_result = new_ideas::algorithm();
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
              << std::endl;
              
    return 0;
}

... check for correctness

#include <chrono>
#include <iostream>
#include "production_code.hpp"
#include "new_ideas.hpp"

int main(int argc, char** argv){

    auto start = std::chrono::high_resolution_clock::now();
    auto result = production_code::algorithm();
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    auto new_result = new_ideas::algorithm();
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    if(result == new_result)
        std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
                  << std::endl;
    else
        std::cout << "Never mind!" << std::endl;
}

noisy lab under your fingers

#include ...

int main(int argc, char** argv){

    auto result = 0;
    auto new_result = 0;
    
    auto start = std::chrono::high_resolution_clock::now();

    for(int i = 0; i < n_repetitions; ++i)
        result = production_code::algorithm();
   
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_p = (end - start);
    
    start = std::chrono::high_resolution_clock::now();
    
    for(int i = 0; i < n_repetitions; ++i)
        new_result = new_ideas::algorithm();
        
    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> t_i = (end - start);

    if(result == new_result)
        std::cout << "we achieved a speed-up of " << t_p.count()/t_i.count() 
                  << std::endl;
    else
        std::cout << "Never mind!" << std::endl;
}

Please take notes

#include ...

using duration_t = std::chrono::duration<double>;

int main(int argc, char** argv){
    //..
    auto start = std::chrono::high_resolution_clock::now();
    auto end = start;
    std::vector<duration_t> prod_timings(n_repetitions), idea_timings(n_repetitions);

    for(int i = 0; i < n_repetitions; ++i) {
        start = std::chrono::high_resolution_clock::now();
        result = production_code::algorithm();
        prod_timings[i] = std::chrono::high_resolution_clock::now() - start;
    }
   
    // similar with new_ideas::algorithm() and idea_timings

    if(result == new_result) {
        std::ofstream ofile("results.csv");
        ofile.open();
        ofile << "#iteration,production,new_idea" << std::endl;
        for(int i = 0; i < n_repetitions; ++i) {
            ofile << i << "," << prod_timings[i].count() << "," << idea_timings[i].count() << std::endl;
        }
        ofile.close()
    }
    else
        std::cout << "Never mind!" << std::endl;
}

Let's take a toy example

... what if

Ensemble variances tell a story!
Ensemble variances tell a story!

google/benchmark

 

  • written in C++11/C++03
  • support of multi-threaded applications
  • powerful CLI
  • easy setup of (templated) test cases
  • flexible argument control
  • custom counters/timers

benchmark: simple approach

#include <benchmark/benchmark.h>
#include <vector>

template <typename T>
double sum(const T* _data, std::size_t _len){

    double value = 0;
    for(std::size_t i = 0; i < _len; ++i)
        value += _data[i];

    return value;
}

template <typename container_type>
double sum(const container_type& _data){

    typedef typename container_type::value_type value_t;

    double value = 0;
    for(const value_t& el : _data)
        value += el;

    return value;
}
static void BM_integer_index(benchmark::State& state) {

    const std::size_t len = 1 << 20;
    std::vector<int> values(len, 0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values.data(), len));
    }
}
// Register the function as a benchmark
BENCHMARK(BM_integer_index);

static void BM_range_based(benchmark::State& state) {

    const std::size_t len = 1 << 20;
    std::vector<int> values(len, 0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values));
    }

}
BENCHMARK(BM_range_based);

BENCHMARK_MAIN();

Analysis by Matt Godbolt at CppCon2017

benchmark: simple approach output

Run on (4 X 3600 MHz CPU s)
2017-11-08 10:24:43
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisyand will incur extra overhead.
--------------------------------------------------------
Benchmark                 Time           CPU Iterations
--------------------------------------------------------
BM_integer_index     922920 ns     915531 ns        764
BM_range_based       937344 ns     929681 ns        768

CPU scaling can be fixed with LIKWID or cpupower

benchmark: advanced

template <typename T>
static void BM_integer_index(benchmark::State& state) {

    const std::size_t len = state.range(0);
    std::vector<T> values(len, 0.f);
    double result = 0;

    for (auto _ : state){
        benchmark::DoNotOptimize(result = sum(values.data(), len));
    }
}

BENCHMARK_TEMPLATE(BM_integer_index,int)
->Arg(1<<20)
->Arg(128<<20);
BENCHMARK_TEMPLATE(BM_integer_index,float)
->Arg(1<<20)
->Arg(128<<20);

BENCHMARK_MAIN();
  • multiple arguments are also supported

    BENCHMARK_TEMPLATE(BM_integer_index,int)
    //42 is the initial value of the reduced sum
    ->Args({64, 42})
    //..
    ;
  • templated benchmark cases are supported
  • workflow:

  1. build benchmark
    (different working set sizes, types)
  2. compile with varying flags
  3. run & inspect
  4. render report with rmarkdown

benchmark: advanced output

Run on (4 X 3600 MHz CPU s)
2017-11-08 10:25:27
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_integer_index<int>/1048576         924382 ns     916717 ns        761
BM_integer_index<int>/134217728    123700290 ns  122614766 ns          6
BM_integer_index<float>/1048576       914593 ns     909174 ns        763
BM_integer_index<float>/134217728  122954355 ns  122219776 ns          6
BM_range_based<int>/1048576           912475 ns     907277 ns        761
BM_range_based<int>/134217728      122509880 ns  121832332 ns          6
BM_range_based<float>/1048576         917501 ns     912365 ns        735
BM_range_based<float>/134217728    122908318 ns  122219268 ns          6

Where to stop?

  • clearify upfront
  • where is your limit?
    • hardware
    • APIs and libraries
    • dependencies
    • compiler
    • OS

roofline

  • acknowledge boundaries of algorithm
  • differentiate "work" versus "traffic"
  • simplistic: bottleneck is either work or traffic
  • clear indication where optimisations go

ecm (Execution-Cache-Memory)

ECM Model: {TOL||TnOL|TL1L2|TL2L3|TL3MEM}

8th Order Stencil
3D long-range SP stencil

Model:{57||54|40|24|50}


  • Roofline does not explain saturation effects
  • ECM puts machine characteristics into the model
  • some operations cannot be overlapped!
  • takes cache traffic into account

performance model with kerncraft

$ cat /tmp/3d.c
double U[M][N][N];
double V[M][N][N];
double ROC[M][N][N];
double c0, c1, c2, c3, c4, lap;

for(int k=4; k < M-4; k++) { for(int j = 4; j < N-4; j++) { for(int i = 4; i < N-4; i++) {
            lap = c0 * V[k][j][i]
                + c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1])
                [...10 lines...]
                + c4 * ( V[k+4][ j ][ i ] + V[k-4][ j ][ i ]);
            U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i]
                       + ROC[k][j][i] * lap; }}}
$ kerncraft -p Roofline -m SandyBridgeEP_E5-2680.yml \
/tmp/3d.c -D N 1000 -D M 1000
====================== kerncraft =======================
/tmp/3d.c               -m SandyBridgeEP_E5-2680.yml
-D N 1000 -D M 1000
------------------------------ Roofline ----------------
Cache or mem bound with 1 core(s)
7.43 GFLOP/s due to MEM transfer bottleneck
Arithmetic Intensity: 0.43 FLOP/B
$ kerncraft -p ECM -m SandyBridgeEP_E5-2680.yml \
/tmp/3d.c -D N 1000 -D M 1000
=========================== kerncraft ==================
/tmp/3d.c               -m SandyBridgeEP_E5-2680.yml
-D N 1000 -D M 1000
------------------------------------- ECM --------------
{ 57.0 || 54.0 | 40.0 | 24.0 | 50.3 } cy/CL
{ 56.95 \ 94.0 \ 118.0 \ 168.3 } cy/CL
saturating at 3.3 cores
  • Loop Kernel Analysis and Performance Modeling Toolkit
  • static code analysis to infer data reuse and cache requirements
  • can infer in-core and memory bottlenecks
  • apply performance models to benchmarked data
  • currently only Intel (IACA), OSACA in development

Bottom Line

  • your requirements are your guiding light in the dungeon
  • reproducible ensemble based benchmarks are key

Summary

Take aways

Take a balloon:

Use Tools to check the lay of the land.

Falsify the rubber duck:

Profile and check your hypothesis.

Survive the dungeon

With automated ensemble based benchmarks.

Backup

Textual output, gprof

$ g++ -pg -O2 -std=c++11 vector_unroll_example.cpp
$ ./a.out
$ gprof ./a.out gmon.out > analysis.txt
$ head analysis.txt
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 26.71      1.02     1.02                             void my_axpy<6u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 26.71      2.05     1.02                             void my_axpy<2u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 23.83      2.96     0.91                             void my_axpy<8u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
 23.04      3.84     0.88                             void my_axpy<4u, vector<float>, vector<float>, vector<float> >(vector<float>&, vector<float> const&, vector<float> const&)
  0.00      3.84     0.00        1     0.00     0.00  _GLOBAL__sub_I_main

Profile from Peter Gottschling's example on vector unrolling.

Do deeper? LIKWID micro-benchmarking

DAXPY optimized for SSE (simplified)

STREAMS 2      // named STR0 and STR1
TYPE DOUBLE
FLOPS 2                   // 2 flops per iter
BYTES 24                  // use 24 bytes per iter
DESC y := alpha * x + y
INSTR_LOOP 4              // 4 instructions per iter
UOPS 5                    // 5 micro-ops per iter
movaps FPR7, [rip+SCALAR] // random value for register
LOOP 16                   // inc GPR1 by 16
movaps   FPR1, [STR0 + GPR1*8]
mulpd    FPR1, FPR7
addpd    FPR1, [STR1 + GPR1*8]
movaps   [STR1 + GPR1*8], FPR1
  • variety of assembly kernels
  • kernels written in abstracted assembly
  • support for SSE, AVX, AVX512, FMA, NT-Stores
  • manages thread and data placement
  • hardware counter through likwid-perfctr
$ likwid-bench -t daxpy_sse -w N:100MB:1
---------------------------------------------------
LIKWID MICRO BENCHMARK
Test: daxpy_sse
---------------------------------------------------
Using 1 work groups
Using 1 threads
---------------------------------------------------
Cycles:     5278051829
Time:     1.316899e+00 sec
[...]
MFlops/s:   1214.98
MByte/s:    14579.71
[...]
Instructions:   1900000017
UOPs:     2600000000
-----------------------------------------------------
$ likwid-perfctr -C 0 -g FLOPS_DP -m \
likwid-bench -t daxpy_sse -w N:100MB:1
---------------------------------------------------
bench:
MFlops/s:   1213.88
Instructions:   1900000017
---------------------------------------------------
perfctr:
DP MFLOP/s    1213.8804
INSTR_RETIRED_ANY 1900008000
---------------------------------------------------

Life as a reviewer

Standardized, easy-to-parse output!