# MASR: A Modular Accelerator for Sparse RNNs

Udit Gupta, Brandon Reagen,

Lillian Pentecost, Marco Donato, Thierry Tambe

Alexander M. Rush, Gu-Yeon Wei, David Brooks



Harvard University

#### In this talk





Custom sparse encoding





#### RNNs can revolutionize interactions with tech



#### Must deploy RNNs onto resource constrained HW



#### RNNs levy high inference cost

#### Large memory footprint

#### High compute footprint

# High energy footprint





Tens of MBs for ASR Tens of GFLOPs for ASR Billions of memory accesses for ASR





















#### RNNs cost scales with input length (timesteps)



#### RNNs cost scales with input length (timesteps)



#### With increasing number of timesteps:

- Number of matrix-vector operations increases (FLOPs)
- Activation storage increases (area)

#### RNNs pose unique challenges

# Solutions for CNNs optimize static weights



#### RNNs pose unique challenges

## Solutions for CNNs optimize static weights



#### Limitations of current sparse DNN accelerators

**EIE: Efficient Inference Engine on Compressed DNNs** 

Song Han, et. al. ISCA, 2016, citation count: 909



Reduces weight footprint by 3x

Does not compress activations (up to 3x savings)

#### Limitations of current sparse DNN accelerators



Reduces weight footprint by 3x

Does not compress activations (up to 3x savings)

Does not scale (over 2x savings at high parallelism)

Problem

Solution



Large memory footprint – static weights and dynamic activations

Logic centric sparse encoding



Problem

Large memory footprint – static weights and dynamic activations



Irregular, sparse computation makes parallelism hard

Solution

Logic centric sparse encoding

Scalable sparse encoding architecture Accelerator to exploit parallelism



Problem

Large memory footprint – static weights and dynamic activations



Irregular, sparse computation makes parallelism hard

Performance 100

High performance with parallelism and irregularity is hard

Solution

Logic centric sparse encoding

Scalable sparse encoding architecture Accelerator to exploit parallelism

Work stealing for load balancing



Problem

Large memory footprint – static weights and dynamic activations



Irregular, sparse computation makes parallelism hard

Solution

Logic centric sparse encoding

Scalable sparse encoding architecture Accelerator to exploit parallelism



High performance with parallelism and irregularity is hard Work stealing for load balancing

|               |            |          |          | Λ | <i>Aen</i>                   | nor                  | у с            | en                      | tri           | C                                        |
|---------------|------------|----------|----------|---|------------------------------|----------------------|----------------|-------------------------|---------------|------------------------------------------|
|               |            |          |          |   |                              |                      |                |                         |               |                                          |
|               |            |          |          |   |                              |                      |                |                         |               |                                          |
| Weight Matrix |            |          |          |   | <b>Compressed Sparse Row</b> |                      |                |                         |               |                                          |
| W             | eight      | Mat      | rix      |   |                              | Con                  | npre           | ssec                    | d Spa         | arse Row                                 |
| We            | eight<br>0 | Mat      | rix      | ] | 2                            | Con<br>2             | npre<br>1      | ssec<br>]               | d Spa         | arse Row<br>Row pointers                 |
| 0             | eight<br>0 | Mat<br>0 | rix<br>0 | ► | 2                            | <b>Con</b><br>2<br>3 | npre<br>1<br>0 | <b>ssec</b><br> <br>  1 | 3 <b>Sp</b> a | arse Row<br>Row pointers<br>Col pointers |

Current state-of-the-art Song Han, et. al. ISCA, 2016



Pressures memory system (2 pointers/1 weight)

Static weight encoding computed offline Activations generated at run-time; uncompressed

> Current state-of-the-art Song Han, et. al. ISCA, 2016



Pressures memory system (2 pointers/1 weight)

Static weight encoding computed offline Activations generated at run-time; uncompressed

> Current state-of-the-art Song Han, et. al. ISCA, 2016

#### Logic centric Sparse encoding binary mask 0 0 1 1 1 1 0 0 + Logic

0

0

0

1

Proposed solution



Pressures memory system (2 pointers/1 weight)

Static weight encoding computed offline Activations generated at run-time; uncompressed

> Current state-of-the-art Song Han, et. al. ISCA, 2016



Relieves memory pressure (single pointer)

Compute sparse address at **run-time**! Weights and activations are **compressed** 

Proposed solution

















Can compute address of sparse weight/activation stored compactly in memory!





Problem

Large memory footprint – static weights and dynamic activations



makes parallelism hard

Solution

Logic centric sparse encoding

Scalable sparse encoding architecture Accelerator to exploit parallelism

Optimizes

Area



Irregular, sparse computation

Work stealing for load balancing



High performance with parallelism and irregularity is hard

#### Memory centric encodings do not scale



#### Memory centric encodings do not scale



Song Han, et. al. ISCA, 2016

#### Memory centric encodings do not scale



#### Proposed: parallelize logic centric encoding



Single MASR Lane

#### Proposed: parallelize logic centric encoding



#### Proposed: parallelize logic centric encoding



MASR's sparse encoding improves scalability Weight mask Memory (MiB) 1 2 2 4 Row pointer **Activations** Weights 128 32 64 256 512 32-512 Logic centric

Number of MACs/PEs

#### **Takeaways**

Scalable sparse encoding and architecture

• Enables highly parallel execution with varying number of MACs/PEs

*Memory centric* 



Problem

Large memory footprint – static weights and dynamic activations



Irregular, sparse computation makes parallelism hard

Performance 100

High performance with parallelism and irregularity is hard Solution

Logic centric sparse encoding

Scalable sparse encoding architecture Accelerator to exploit parallelism

Work stealing for load balancing

Optimizes

Area

#### Parallelism within matrix-vector multiplication



# MASR micro-architecture: Parallelizing across input and output neurons





Composed of 2D array of lanes

- Horizontal lanes parallelize output neurons
- Vertical lanes parallelize input neurons

#### PEs composed of neighboring horizontal lanes

- Share activation register file (area, power, load time)
- Private weight and mask SRAM within lane (decoupled to enable high-bandwidth access)









Recurrent Neural Network







MASR is *2 orders* of magnitude faster than CPU

Fits our *on-chip* area target for mobile devices



MASR is *2 orders* of magnitude faster than CPU

Fits our *on-chip* area target for mobile devices



Lanesx64 Lanesx256 Lanesx1024



MASR is *2 orders* of magnitude faster than CPU

Fits our *on-chip* area target for mobile devices

#### **Takeaways**

Parallelism can be configured to target: High-performance Energy-efficiency Area-efficiency



Problem

Large memory footprint – static weights and dynamic activations

Irregular, sparse computation makes parallelism hard

Scalable sparse encoding architecture Accelerator to exploit parallelism

Logic centric sparse encoding

Solution

Optimizes

Area

Performance Area Energy



High performance with parallelism and irregularity is hard

Work stealing for load balancing

55



#### gating sources of inefficiency



Increasing number of parallel MACs/lanes from 64 to 1024 (16x), improves performance by 5.5x



### gating sources of inefficiency



Increasing number of parallel MACs/lanes from 64 to 1024 (16x), improves performance by 5.5x

#### 30% MAC utilization

Remainder spent on stalls/idles due to load imbalance



## gating sources of inefficiency



Increasing number of parallel MACs/lanes from 64 to 1024 (16x), improves performance by 5.5x

#### 30% MAC utilization

Remainder spent on stalls/idles due to load imbalance

Some lanes get 1.5x more work (non-zeros) to process

Lanes that finish early can steal work from neighboring lanes that are straggling behind



Lanes that finish early can steal work from neighboring lanes that are straggling behind



Horizontal load balancing requires duplicating weights

Lanes that finish early can steal work from neighboring lanes that are straggling behind



Horizontal load balancing requires duplicating weights

Vertical load balancing requires duplicating weights and activation register files



Vertical load balancing better targets load imbalance in dynamic activations Up to 1.9x speedup (LANESx1024) Requires duplicating 10% weight storage and activation register files



Problem

Large memory footprint – static weights and dynamic activations

Irregular, sparse computation makes parallelism hard

Scalable sparse encoding architecture Accelerator to exploit parallelism

Logic centric sparse encoding

Solution

Optimizes

Area

Performance Area Energy

Performance 100

High performance with parallelism and irregularity is hard Work stealing for load balancing

Performance

**Over state-of-the-art, MASR provides:** 

3x area

3x energy

2x perf

#### Scalable acceleration of sparse RNNs is possible!



Stay tuned...

#### MASR: A Modular Accelerator for Sparse RNNs

<u>Udit Gupta</u>, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe Alexander M. Rush, Gu-Yeon Wei, David Brooks



ugupta@g.harvard.edu

Thanks for listening!

