Kiwi Performance Predictor - Drafts

<title>
Kiwi Performance Predictor - Drafts
</title>

<link rel="stylesheet" href="kingswox.css">

Back link: <a href="http://www.cl.cam.ac.uk/~djg11/kiwi">Kiwi Home</a>.

<p>

<h1>Kiwi Performance Predictor - Drafts</h1>


<p>It is widely accepted that many problems in scientific computing can be vastly accelerated
using either dedicated hardware or GPU execution resources.  Custom silicon implementation
using ASIC gives at least an order of magnitude energy saving as an additional benefit.
FPGA is less efficient than ASIC but requires orders of magnitude less investment in
design time and non-recurring cost. However, an FPGA design may still take many 
hours to compile which is inconvenient when the performance achieved after this investment
was less than expected.  Using high-level synthesis (HLS), tools such as Kiwi, LegUp or
Catapult-C it is relatively quick and easy to convert an algorithm from software to hardware form, but
these tools can sometimes produce poor results  and discovering exactly why has often been
very difficult.  Owing to the lengthy compile times, it is tiresome to test small design changes, but
these are typically the only basis available to guide iterative improvement of the implementation.

<p>Issues that commonly arise when addressing Scientific Computing with  FPGA targets are:

<ul>
  <P><li>  <b> Long compile time:</b>  FPGA tools typically take several hours to compile  a program. This extends
the program development cycle and prohibits a class of casual <i>what-if</i> experiments that would otherwise be run.

  <P><li>  <b> Not fitting:</b> The resultant design does not fit on the available FPGA execution 
resource or cannot be effectively split over the physical chips.

  <P><li>  <b> Unpredictable or Poor performance:</b> The choice of data structure does not fit well on the cache and DRAM structure chosen and the tools do not given any indication how long a program might take.

  <P><li>  <b> Restrictive Parallelism:</b> The toolchain supports no parallel programming constructs, or else requires cumbersome primitives and partitioning that require considerable code re-factoring to change the balance of work between tasks compared with
 a software implementation.

  <P><li>  <b> Restrictive Language Subset:</b> Only a subset of the high-level language's basic constructs are available for the high-performance computing HPC platform with major features such as dynamic storage allocation and recursion miss
ing.

  <P><li> <b> Steep learning curve:</b> The tools were designed for hardware engineers and give error messages that programmers do not really understand.

  <P><li>  <b> Immaturity and lack of standard libraries:</b> The dotnet programmer expects the standard rich common runtime environment to be available, but in fact only a small subset of facilities are supported for FPGA execution.

  <P><li>  <b> Lack of debug and visualisation facilities:</b> The FPGA is a black box and it is nigh on impossible to find out what is going on.  The long-compile-time issue makes adding extra print statements rather unattractive despite som
e systems supporting framestores for visualisation 
and console WriteLine statements tunnelled over the Ethernet
(Kiwi supports these and, for the future,  the  LEAP FPGA operating System
offers greater promise in these respects.)
</ul>


In this draft we report early work in addressing problems 1 to 3 on the above list.
The work is based on the Kiwi HPC accelerator that compiles C# programs for FPGA.  The Kiwi compiler
consists of multiple internal processing stages.  The user's program can be executed at each
intermediate stage, either symbolically or with concrete data.  Each stage must make implementation
decisions.  These can be influenced by performance profiling information from both earlier and later
stages.  As always, with profile-directed optimisation, it is the effects of a previous run of a
later stage that guides its resynthesis in the current compilation.  Moreover, in our new approach, <b> the
profiling information is fed back to the programmer, as quickly as possible, so that he can
rapidly understand the performance effects of a small program edit </b>.  We would aim to present this information,
in the future, via a general purpose integrated development environment (IDE) such as MonoDevelop.
But currently it is shown in a web page that automatically redisplays and that the user can keep open next
to his source code editing window.


<h2>Example 1: Primes Finding with three Variants of the Sieve of Eratosthenes</h2>


<p>The following single-threaded code has three variant algorithms and one major scaling parameter.

<pre>
// primesya.cs - Sieve of Eratosthenes demo.
// Kiwi Scientific Acceleration
// $Id: primesya.cs,v 1.7 2011/06/08 17:45:10 djg11 Exp $
// (C) 2010 - DJ Greaves - University of Cambride Computer Laboratory

using System;
using KiwiSystem;

/*
  Example runtimes on mono (no Kiwi) for 100K (intel i5-3337 @1.8GHz)                                                      
  A lot of prints here, so a bit slow, but shows the effect of evariation.                                                 
evariant=0  real        0m10.501s user  0m4.798s  sys   0m5.763s                                                           
evariant=1  real        0m4.870s  user  0m2.288s  sys   0m2.612s                                                           
evariant=2  real        0m3.502s  user  0m1.571s  sys   0m1.952s                                                           
*/

// There are three (at least) variations on this program that vary in efficiency but give the same result.
// They vary in the their control flow graphs.
// Performance predictor questions:
//   1. Can we extrapolate the performance within one variation to larger runs ?
//   2. Can we estimate the performance of the optimised code from the non-optimised code?
//   
//   The extrapolation is complicated by the DRAM banks, because the smaller runs operate within one row of each bank so there is now row writeback and precharge overhead. For the latter crossoff stages, this becomes a significant overhead owing to the wide strathe between accesses.
//   
//   Adding a DRAM cache makes little difference owing to excessive churn, but could we automatically predict that ?
//   

class primesya
{
//   The following major parameter is edited by a sed script invoked by the Makefile before each run.
//   It needs to be a compile-time constant since Kiwic chooses what type of memory organisation to use based on its value.
  static int limit = 1000  * 100;

    static bool [] PA = new bool[limit]; // Placed in off-chip DRAM owing to its size.

    // This input port, vol, was added to make some input data volatile.
    [Kiwi.InputWordPort(31, 0)][Kiwi.OutputName("volume")] static uint vol;
    [Kiwi.OutputWordPort(31, 0)][Kiwi.OutputName("count")] static uint count = 0;
    static int count1 = 0;
    [Kiwi.OutputWordPort(31, 0)][Kiwi.OutputName("elimit")] static int elimit = 0;      // The main scaling parameter (abscissa).
    // The variant is also edited by a sed script that runs an individual experiment.    
    [Kiwi.OutputWordPort(31, 0)][Kiwi.OutputName("evariant")] static int evariant = 2;    // The alogorithmic variant - must hold when finish is asserted.
    [Kiwi.OutputWordPort(31, 0)][Kiwi.OutputName("edesign")] static int edesign = 4032; // A uid for this program
    [Kiwi.OutputBitPort("finished")] static bool finished = false;
    [Kiwi.HardwareEntryPoint()]
    public static void Main()
    {
      bool kpp = true;
      elimit = limit;
      Kiwi.KppMark("START", "INITIALISE");  // Waypoint
      Console.WriteLine("Primes Up To " + limit);
      Kiwi.Pause(); 
      PA[0] = vol > 0; // Process some runtime input data on this thread - prevents Kiwic running the whole program at compile time.
      Kiwi.Pause(); 
      // Clear array
      count1 = 2; count = 0; // RESET VALUE FAILED AT ONE POINT: HENCE NEED THIS LINE
      for (int woz = 0; woz &lt; limit; woz++) 
        {
          PA[woz] = true; 
          Console.WriteLine("Setting initial array flag to hold : addr={0} readback={1}", woz, PA[woz]); // Read back and print.
        }
      
      Kiwi.KppMark("wp2", "CROSSOFF"); // Waypoint
      int i, j;
      
      for (i=2;i&lt;limit; i++)  // Can our predictor cope with the standard optimisations?
        {
          // Cross off the multiples - optimise by skipping where the base is already crossed off.
          if (evariant > 0)     
            {
              bool pp = PA[i];
              Console.WriteLine(" tnow={2}: check back {0} = {1} ", i, pp,  Kiwi.tnow);
              if (!pp)            continue;
              count1 += 1;
            }
          // Can further optimise by commencing the cross-off of the factor squared.
          j= (evariant > 1) ? i*i : i;
          if (j >= limit)       
            {
              Console.WriteLine("Skip out on square");
              break;
            }
          for (; j&lt;limit; j+=i) 
            {
              Console.WriteLine("Cross off {0} {1}   (count1={2})", i, j, count1);
              PA[j] = false; 
            }
        }
      
      Kiwi.KppMark("wp3", "COUNTING");  // Waypoint
      Console.WriteLine("Now counting");
      // Count how many there were and store them consecutively in the output array.
      for (int w = 0; w &lt; limit; w++) 
        {
          if (PA[w]) 
            {
              count += 1; 
            }
          Console.WriteLine("Tally counting {0} {1} at {2}", w, count, Kiwi.tnow);
        }
      Console.WriteLine("There are {0} primes below the natural number {1}.", count, limit);
      Console.WriteLine("(count1 is {0}).", count1);
      Kiwi.Pause();
      finished = true;
      Kiwi.Pause();
      Kiwi.KppMark("FINISH");  // Waypoint
      Kiwi.Pause();
    }
}
// eof
</pre>

<p>We note the above design invokes KppMark to designate 'waypoints' that are articulation points in the control
flow graph.  Variations of the design can be compared and where the control-flow graph between waypoints is unchanged
from when performance data was captured the performance predictor has confidence in the relative visit ratios between control-flow states.


<p>The above design generates RTL output under the Kiwi compiler.  Example Output <a href="offchip-v2.txt">425 lines of Verilog RTL</a>.

<pre>
module SIEVE_DUT(
    output reg [639:0] KppWaypoint0,
    output reg [639:0] KpSWaypoint1,
    input [31:0] volume,
    output reg [31:0] count,
    output reg signed [31:0] elimit,
    output reg signed [31:0] evariant,
    output reg signed [31:0] edesign,
    output reg finished,
    output reg hf1_dram0bank_opreq,
    input hf1_dram0bank_oprdy,
    input hf1_dram0bank_ack,
    output reg hf1_dram0bank_rwbar,
    output reg [255:0] hf1_dram0bank_wdata,
    output reg [21:0] hf1_dram0bank_addr,
    input [255:0] hf1_dram0bank_rdata,
    output reg [31:0] hf1_dram0bank_lanes,
    input clk,
    input reset);
</pre>
It connects to a single DRAM subsystem with the <pre>hf1_x</pre> nets that is cached and the synthesis expects each
DRAM access cycle to hit in the cache and be served in a single clock cycle.

<p>The Kiwi compiler writes an XML file describing the design that will be imported by the performance predictor <a href="kiwiprofile.txt">(XML 198 lines)</a>.

<p>Simulation runs write XML files providing data points 


<p>Console output from the simulation (snipped at ...)
<pre>
Primes Up To 100
                6000: pc=         17: new waypoint monitored START
Setting initial array flag to hold : addr=         99 readback=1
...
             1533000: pc=         15: new waypoint monitored wp2
 tnow=                   0: check back           2 =                                                                              1 
...
             3267000: pc=          8: new waypoint monitored wp3
...
Tally counting          97         27 at                    0
Tally counting          98         27 at                    0
Tally counting          99         27 at                    0
There are         27 primes below the natural number 100.
Optimisation variant=2 (count1 is           7).
 finished at              5068000 after                 5064 clocks
 pc visit count report kpp_hwm=         18
 wp=          0, pc visit counts,           0,          5
 wp=          1, pc visit counts,          13,       1411
 wp=          1, pc visit counts,          14,         99
 wp=          1, pc visit counts,          16,          1
 wp=          1, pc visit counts,          17,         15
 wp=          1, pc visit counts,          18,          1
 wp=          2, pc visit counts,           1,       1376
 wp=          2, pc visit counts,           2,         98
 wp=          2, pc visit counts,           4,          4
 wp=          2, pc visit counts,          10,         75
 wp=          2, pc visit counts,          11,        170
 wp=          2, pc visit counts,          12,         10
 wp=          2, pc visit counts,          15,          1
 wp=          3, pc visit counts,           7,          1
 wp=          3, pc visit counts,           8,        100
 wp=          3, pc visit counts,           9,       1700

</pre>

<p>
<a href="primes-optimised-controlgraph.png"><img src="primes-optimised-controlgraph-large-thumbnail.jpg"></a>


<h3>Example VCD Waveform Plots</h3>

<p>The three phases of the program, between the two way points, are clear in the following
VCD dump of a very small run that finds the primes below 100.  (In this run, the system was
forced, via command line flags,  to use DRAM for the bool array, despite the very small size. And no
cache was interposed, meaning that the design is stalled for 11 or so clock cycles for each array
operation, even though it was schedulled to expect only one.) In real systems, the DRAM runs
six times faster than the FPGA logic, reducing 11 to 2, and a cache is present, that would have
only compulsorary  misses for such a small array. Note, 'woz' is V_0, and i and j are V_1 and V_2 etc..

<p><img src="optimised-primes-sieve-waveforms-100.png">


<p>Example simulation run profile <a href="primes-profile-run.txt">profile.xml</a>.


<h3>Prediction Structure</h3>

<p><img src="predictor-usecase.png">

<p>The general flow of the predictor is illustrated in this figure.

<h3>Prediction Extrapolation</h3>

<p>
 <img src="kiwi-performance-predictor-first-output.png">


<p> ... under construction

<p>
<HR>
<p>(C) 2015 DJ Greaves.  Back link: <a href="http://www.cl.cam.ac.uk/~djg11/kiwi">Kiwi Home</a>.