#### Haskell to Hardware and Other Dreams

Stephen A. Edwards

Richard Townsend Lianne Lairmore Martha A. Kim Kuangya Zhai

Columbia University

Synchron, Bamberg, Germany, December 7, 2016









#### Moore's Law



"The complexity for minimum component costs has increased at a rate of roughly a factor of two per year."

Closer to every 24 months

Gordon Moore, Cramming More Components onto Integrated Circuits, Electronics, 38(8) April 19, 1965.

## Four Decades of Microprocessors Later...



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

Source: https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/

## What Happened in 2005?



2000 1 core Transistors: 42 M



Core 2 Duo 2006 2 cores 291 M



Xeon E5 2012 8 cores 2.3 G



#### Heat Flux in IBM Mainframes: A Familiar Trend



Schmidt. Liquid Cooling is Back. Electronics Cooling. August 2005.

## Liquid Cooled Apple Power Mac G5





2004 CMOS 1.2 kW

## Dally: Calculation Cheap; Communication Costly



"Chips are power limited and most power is spent moving data

Performance = Parallelism

Efficiency = Locality

Bill Dally's 2009 DAC Keynote, The End of Denial Architecture

## Parallelism for Performance; Locality for Efficiency



Dally: "Single-thread processors are in denial about these two facts"

We need different programming paradigms and different architectures on which to run them.



**Related Work** 

## Xilinx's Vivado (Was xPilot, AutoESL)

- ◆ SSDM (System-level Synthesis Data Model)
  - Hierarchical netlist of concurrent processes and communication channels



- Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics
  - Port / IO interfaces, bit-vector manipulations, cycle-level notations

SystemC input; classical high-level synthesis for processes

Jason Cong et al. ISARS 2005

### Taylor and Swanson's Conservation Cores



Custom datapaths, controllers for loop kernels; uses existing memory hierarchy

Swanson, Taylor, et al. Conservation Cores. ASPLOS 2010.

## Bacon et al.'s Liquid Metal



Fig. 2. Block level diagram of DES and Lime code snippet

JITting Lime (Java-like, side-effect-free, streaming) to FPGAs Huang, Hormati, Bacon, and Rabbah, *Liquid Metal*, ECOOP 2008.

### Goldstein et al.'s Phoenix



Figure 3: C program and its representation comprising three hyperblocks; each hyperblock is shown as a numbered rectangle. The dotted lines represent predicate values. (This figure omits the token edges used for memory synchronization.)



Figure 8: Memory access network and implementation of the value and token forwarding network. The LOAD produces a data value consumed by the oval node. The STORE node may depend on the load (i.e., we have a token edge between the LOAD and the STORE, shown as a dashed line). The token travels to the root of the tree, which is a load-store queue (LSQ).

#### C to asynchronous logic, monolithic memory

Budiu, Venkataramani, Chelcea and Goldstein, Spatial Computation, ASPLOS 2004.

## Ghica et al.'s Geometry of Synthesis



Algol-like imperative language to handshake circuits Ghica, Smith, and Singh. *Geometry of Synthesis IV*, ICFP 2011

## **Greaves and Singh's Kiwi**

```
public static void SendDeviceID()
{ int deviceID = 0 \times 76:
  for (int i = 7; i > 0; i--)
  \{ scl = false : \}
    sda\_out = (deviceID \& 64) != 0;
    Kiwi.Pause(); // Set it i-th bit of the device ID
    scl = true; Kiwi.Pause(); // Pulse SCL
    scl = false; deviceID = deviceID << 1;
    Kiwi.Pause();
```

C# with a concurrency library to FPGAs

Greaves and Singh. Kiwi, FCCM 2008

## Arvind, Hoe, et al.'s Bluespec

```
GCD Mod Rule
      Gcd(a, b) if (a>b) \land (b \neq 0) \rightarrow Gcd(a-b, b)
GCD Flip Rule
      Gcd(a, b) if a < b \rightarrow Gcd(b, a)
             \pi_{Flip} + \pi_{Mod}
                                \delta_{Fl\underline{ip,b}}
                                                                            \pi_{Flip}
                     a
                                           Mod,a
                                                                            \boldsymbol{\pi}_{Mod}
                                \overline{\delta}_{Flip,a}
```

Figure 1.3 Circuit for computing Gcd(a, b) from Example 1.

Guarded commands and functions to synchronous logic Hoe and Arvind, *Term Rewriting*, VLSI 1999

 $\pi_{Flip}$ 

#### Sheeran et al.'s Lava

Figure 10: A butterfly stage of size 8 expressed with riffling

#### Functional specifications of regular structures

Bjesse, Claessen, Sheeran, and Singh. Lava, ICFP 1998

## Kuper et al.'s $C\lambda$ aSH

$$\begin{array}{l} \mathit{fir}\;(\mathit{State}\;(\mathit{xs},\mathit{hs}))\;x = \\ (\mathit{State}\;(\mathit{shiftInto}\;x\;\mathit{xs},\mathit{hs}),(\mathit{x} \rhd \mathit{xs}) \bullet \mathit{hs}) \end{array}$$



Fig. 6. 4-taps FIR Filter

More operational Haskell specifications of regular structures Baaij, Kooijman, Kuper, Boeijink, and Gerards. Cλash, DSD 2010

My Crusade

## Deterministic Concurrency: A Fool's Errand?

### What Models of Computation Provide Determinstic Concurrency?

| Synchrony           | The Columbia Esterel Compiler 2001–2006 |
|---------------------|-----------------------------------------|
|                     | SHIM                                    |
| Kahn Networks       | The SHIM Model/Language<br>2006–2010    |
| The Lambda Calculus | This Project<br>2010–                   |

















## Why Functional?

- Referential transparency simplifies formal reasoning about programs
- Inherently concurrent and deterministic (Thank Church and Rosser)
- Immutable data makes it vastly easier to reason about memory in the presence of concurrency





## To Implement Real Algorithms, We Need

Structured, recursive data types





Recursion to handle recursive data types



Memories



**Memory Hierarchy** 



Structured, Recursive Data

**Types** 

## **Algebraic Data Types**

In modern functional languages: ML, OCaml, Haskell, ...

An algebraic type is a sum of product types

Basic example: List of integers

```
data IntList = Nil
| Cons Int IntList
```

Constructing a list:

```
Cons 42 (Cons 17 (Cons 2 (Cons 1 Nil)))
```

Summing the elements of a list:

```
sum li = case li of

Nil \rightarrow 0

Cons x xs \rightarrow x + sum xs
```

## An Interpreter in One Slide

Abstract syntax tree data type:

```
data Expr = Lit Int
| Plus Expr Expr
| Minus Expr Expr
| Times Expr Expr
```

Recursive evaluation function:

```
eval e = case e of

Lit x \rightarrow x

Plus e1 e2 \rightarrow eval e1 + eval e2

Minus e1 e2 \rightarrow eval e1 - eval e2

Times e1 e2 \rightarrow eval e1 * eval e2
```

```
eval (Plus (Lit 42) (Times (Lit 2) (Lit 50)))
```

## Algebraic Datatypes in Hardware: Lists

```
data IntList = Cons Int IntList
| Nil
```

| 48 | 33      | 32 1 0 |      |
|----|---------|--------|------|
|    | pointer | int 1  | Cons |
|    |         | 0      | Nil  |

## Recursion to Handle

**Recursive Data Types** 

#### What Do We Do With Recursion?

Compile it into tail recursion with explicit stacks

[Zhai et al., CODES+ISSS 2015]

#### Definitional Interpreters for Higher-Order Programming Languages

John C. Reynolds, Syracuse University

[Proceedings of the ACM Annual Conference, 1972]

Really clever idea:

Sophisticated language ideas such as recursion and higher-order functions can be implemented using simpler mechanisms (e.g., tail recursion) by rewriting.

## Removing Recursion: The Fib Example

fib n = case n of

1 
$$\rightarrow$$
 1

2  $\rightarrow$  1

n  $\rightarrow$  fib (n-1) + fib (n-2)

## Transform to Continuation-Passing Style

## Name Lambda Expressions (Lambda Lifting)

```
fibk n k = case n of
                  → k 1
               2 \rightarrow k1
               n \rightarrow fibk (n-1) (k1 n k)
                        fibk (n-2) (k2 n1 k)
k1 \quad n \quad k \quad n1 =
k2 n1 k n2 =
                      k (n1 + n2)
k0
         x =
                        Χ
fib
                        fibk n k0
      n
```

## Represent Continuations with a Type

```
data Cont = K0 | K1 Int Cont | K2 Int Cont
fibk n k = case (n,k) of
                 (1, k) \rightarrow kk k 1
                 (2, k) \rightarrow kk k 1
                 (n, k) \rightarrow fibk (n-1) (K1 n k)
kk k a = case(k, a) of
       ((K1 n k), n1) \rightarrow fibk (n-2) (K2 n1 k)
       ((K2 n1 k), n2) \rightarrow kk k (n1 + n2)
       (K0, x) \rightarrow x
                           fibk n K0
fib n
```

## **Merge Functions**

```
data Cont = K0 | K1 Int Cont | K2 Int Cont
data Call = Fibk Int Cont | KK Cont Int
fibk z = case z of
    (Fibk 1 k) \rightarrow fibk (KK k 1)
    (Fibk 2 k) \rightarrow fibk (KK k 1)
    (Fibk n k) \rightarrow fibk (Fibk (n-1) (K1 n k))
    (KK (K1 n k) n1) \rightarrow fibk (Fibk (n-2) (K2 n1 k))
    (KK (K2 n1 k) n2) \rightarrow fibk (KK k (n1 + n2))
    (KK K0 x) \rightarrow x
fib n
                          fibk (Fibk n K0)
```

## **Add Explicit Memory Operations**

```
read :: CRef → Cont
write :: Cont → CRef
data Cont = K0 | K1 Int CRef | K2 Int CRef
data Call = Fibk Int CRef | KK Cont Int
fibk z = case z of
    (Fibk 1 k) \rightarrow fibk (KK (read k) 1)
    (Fibk 2 k) \rightarrow fibk (KK (read k) 1)
    (Fibk n k) \rightarrow fibk (Fibk (n-1) (write (K1 n k)))
    (KK (K1 n k) n1) \rightarrow fibk (Fibk (n-2) (write (K2 n1 k)))
    (KK (K2 n1 k) n2) \rightarrow fibk (KK (read k) (n1 + n2))
    (KK K0 x) \rightarrow x
                         fibk (Fibk n (write K0))1
fib n
```

## Simplified Functional to Dataflow

Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



Sum a list using an accumulator and tail-recursion



## Non-strict functions enables pipelining



# Dataflow to Hardware

### A Latency-Insensitive Protocol



Inspired by Carloni et al. [Cao et al., Memocode 2015]

## **Input and Output Buffers**



Combinational paths broken:

Input buffer breaks ready path

Output buffer breaks data/valid path

## Larger Systems Run Just As Fast

| Splitters | Token | F <sub>max</sub> | Area Resources |    |           |
|-----------|-------|------------------|----------------|----|-----------|
|           | Bits  | MHz              | ALMs           | %  | Registers |
| 2         | 32    | 167              | 189            | 1  | 414       |
| 2         | 64    | 157              | 350            | 1  | 798       |
| 2         | 128   | 152              | 672            | 2  | 1573      |
| 32        | 128   | 137              | 10821          | 26 | 25536     |
| 64        | 128   | 140              | 21704          | 52 | 51168     |
| 4         | 64    | 158              | 700            | 2  | 1621      |
| 8         | 64    | 145              | 1409           | 3  | 3261      |
| 16        | 64    | 147              | 2826           | 7  | 6559      |
| 32        | 64    | 144              | 5682           | 14 | 13148     |
| 64        | 64    | 138              | 11404          | 27 | 26414     |
| 128       | 64    | 140              | 22914          | 55 | 53087     |

Synthesis results on an Altera Cyclone V. 166 MHz target clock rate.

Moore's Law is alive and well

 But we hit a power wall in 2005.
 Massive parallelism now mandatory

► Communication is the culprit



 Dark Silicon is the future: faster transistors; most must remain off

 Custom accelerators are the future; many approaches

 Our project: A Pure Functional Language to FPGAs





Algebraic Data Types in Hardware

Removing recursion

Functional to dataflow

Dataflow to hardware



#### Functional to Dataflow



#### Input and Output Buffers



Input buffer breaks ready path

Output buffer breaks data/valid path

usput buller breaks deserveno pe