Distributed Operation Set Architectures for lowlatency ML inference using FPGAs Introducing DOSA

Dr. Burkhard Ringlein IBM Research – Europe Zurich, Switzerland

EVEREST + DAPHNE: Workshop HiPEAC 2024 2024-01-19

© 2024 IBM Corporation



#### I guess, we all know why we are here...



#### Agenda: Is one-click DNN to distributed FPGA compilation possible?



 $\rightarrow$  In this presentation, I close the gaps between ML representations (ONNX) and distributed FPGAs (cloudFPGA). B. Ringlein / Hybrid Cloud Research / HiPEAC 2024 / © 2024 IBM Corporation

Target platform: IBM cloudFPGA (as reference for distributed FPGAs)



### **Overview:** 5 necessary steps to map ONNX to cloudFPGA



B. Ringlein / Hybrid Cloud Research / HiPEAC 2024 / © 2024 IBM Corporation

### Overview: 5 necessary steps to map ONNX to cloudFPGA



#### data center network

3 Published in: FCCM 2020, H2RC 2020 Open-source release: https://github.com/cloudFPGA/ZRLMPI

5 4 +

→ Our focus today! DOSA
 Published in: IEEE CAL 2023, EDGE 2023
 Open-source release:
 https://github.com/cloudFPGA/DOSA

Developed within H2020 EVEREST

1+2 Published in: **FPL 2019**, **CLOUD 2021** 

Open-source release: https://github.com/cloudFPGA/cFDK

# **DOSA**, automated compilation of DNN to distributed FPGAS

- One tool to cover large solution space, different optimizations, and major standards
  - Avoids "re-inventing the wheel": composes open source tools: e.g. TVM, hls4ml, haddoc2, VTA
  - But: combines them in an optimal way
  - Based on roofline analysis and framework characterization
  - Decision based on performance constraints
  - Automated re-use with organic-compiler concept and operator set generators
- Automatic partitioning: Model & data parallelism



2

3



Example CNN from https://pytorch.org/tutorials/beginner/blitz/cifar10\_tutorial.html

### DSE based on "3D Rooflines"

- Multi-dimensional space:
  - Resources (i.e. costs)
    - but details matter: LUT, LUTRAM, BRAM, DSP, FF
  - Latency vs throughput
  - Costs of combining different libraries
- "hardware-aware"/"system-aware" optimization: DOSA doesn't consider solutions that would violate the Roofline model
- Multiple runs with different hyperparametern (e.g. "osg\_look\_ahead")
- Solution with lowest costs that fulfills target criteria is chosen

#### DOSA 3D Roofline for CIFAR-10 (draft: selected\_best, node: 7, dpl: 1, opt: THROUGHPUT)





current Role network bandwidth current Role peak performance



q. perf. f. Engine arch. (w/ 5000 sps, batch 1) q. perf. f. Stream arch. (w/ 5000 sps, batch 1)

# Debugging generated by compiler

- Distributed inference means distributed debugging
   → compiler must facilitate it
- Hence, DOSA automatically generates debug probes between IP cores
  - Because we use standardized interfaces
     between IP cores → easily generate able by
     compiler
  - In VHDL and tcl
- We deploy bitstreams using partial reconfiguration → debug bridge support
- (...then we still have to look at waveforms...)

|     |      | ##                                        | ####  |
|-----|------|-------------------------------------------|-------|
|     |      |                                           | Debu  |
|     |      |                                           | ####  |
|     |      |                                           |       |
|     |      | DBG:                                      | ila   |
|     |      | _                                         | rt m  |
|     |      |                                           | clk   |
|     |      |                                           | , pr  |
|     |      |                                           |       |
|     |      |                                           | , pr  |
| 654 | #    |                                           |       |
| 655 | # V] | EVADO -                                   | IP :  |
| 656 | #    |                                           |       |
| 657 | set  | ipMoc                                     | Name  |
| 658 | set  | ipNan                                     | ne    |
| 659 | set  | ipVer                                     | ndor  |
| 660 | set  | ipL1                                      | orary |
| 662 | set  | inCf                                      | SION  |
| 663 | set  | трете                                     | JLISU |
| 664 |      |                                           |       |
| 665 |      |                                           |       |
| 666 |      |                                           |       |
| 667 |      | ipMoc<br>ipNam<br>ipVer<br>ipLik<br>ipVer |       |
| 668 |      |                                           |       |
| 669 |      |                                           |       |
| 670 |      |                                           |       |
| 671 |      |                                           |       |
| 672 |      |                                           |       |
|     |      |                                           |       |

```
dosa role 0¶
ap (¶
 => piSHL 156 25Clk¶
obe0
         => siNRC Udp Data tdata¶
obe1
         => siNRC Udp Data tkeep¶
obe2(0)
         siNRC Udp Data tvalid
         siNRC Udp Data tlast¶
obe3(0)
obe4 ( 🛛 )
         siNRC Udp Data tready¶
obe52
               sMPE Debug¶
          =>
obe53
               sZRLMPI Wrapper Debug
          =>
obe54(0)
               sResetApps n¶
          =>
obe55
               sToFifo input 0 tdata din¶
          =>
obe56(0)
               sToFifo input 0 tdata full n¶
          =>
          => sToFifo input 0 tdata full¶
obe57(0)
               sToFifo input 0 tdata write
obe58(0)
          =>
obe59
               sToFifo input 0 tkeep din
          =>
obe60(0)
               sToFifo input 0 tkeep full n¶
          =>
obe61(0)
          =>
               sToFifo input 0 tkeep full
               sToFifo input 0 tkeep write
obe62 ( 🛛 )
          =>
               sToFifo input 0 tlast din
obe63
          =>
               sToFifo_input_0_tlast_full_n
obe64(0)
          =>
 ILA Core¶
 "ila dosa role 0"¶
 "ila"¶
 "xilinx.com"¶
 "ip"¶
 "6.2"¶
 [list CONFIG.C NUM OF PROBES 112 \¶
       CONFIG.C DATA DEPTH 2048 \
        CONFIG.C PROBE0 WIDTH {64}\¶
        CONFIG.C PROBE1 WIDTH {8}\¶
        CONFIG.C PROBE2 WIDTH
        CONFIG.C PROBE3 WIDTH
        CONFIG.C PROBE4 WIDTH
        CONFIG.C PROBE5 WIDTH {64}\¶
        CONFIG.C PROBE6 WIDTH
        CONFIG.C PROBE7 WIDTH {1}\¶
        CONFIG.C PROBE8 WIDTH
                            {8}\¶
```

### Gains? → Evaluation

- Main problem is not technical unfit of DNN-to-FPGA frameworks, but their usability
- DOSA tries to change that with:
  - Support of major community standards (foremost ONNX)
  - No architectural knowledge necessary at the user side
  - Automated **partitioning**
  - Automated **deployment**

#### DNN-TO-FPGA FRAMEWORKS: PRODUCTIVITY ANALYSIS.

| Framework                              | supports<br>ONNX<br>import | supports<br>distrib.<br>FPGAs | manual<br>scheduling or<br>partitioning<br>required | automated<br>deployment |
|----------------------------------------|----------------------------|-------------------------------|-----------------------------------------------------|-------------------------|
| <b>DOSA</b><br>(this research)         | yes                        | yes                           | no                                                  | yes                     |
| AIgean [23]                            | no                         | yes                           | no                                                  | no                      |
| hls4ml [13], [21]                      | yes                        | no                            | no                                                  | no                      |
| haddoc2 [22]<br>req. legacy BVLC-Caffe | no                         | no                            | no                                                  | no                      |
| Brevitas + FINN<br>[8], [9], [53]      | no                         | (up to 2)<br>[54]             | partly                                              | partly                  |
| VitisAI [55]                           | no                         | no                            | depends on<br>the model                             | partly                  |

### DNN-to-FPGAs: some results

- DOSA allows a "one-click" design-space exploration, partitioning, compilation, and synthesis
  - Deployment automated with ZRLMPI
  - Optimization within seconds (not hours)
- E.g. small CNN for INFaaS (i.e. batch size 1) across 9 cloudFPGAs results in (8 bit weights):
  - >50x speedup, >80x more efficient vs. CPU
  - >18x speedup, >30x more efficient vs. GPUs
  - End-to-end latency below 0.3ms (client-side)
- Speedup and efficiency gains of cloudFPGA:
  - massive parallelism (streaming-architecture) and custom data paths
  - direct network-attachment
  - disaggregation





### **Conclusion?**

- To boost adoption of FPGAs  $\rightarrow$  holistic approach with organic compilers
  - Wide range of DNNs
  - Usable by non-FPGA experts
- Operation Set Architecture overcome current hurdles of DNN-to-FPGA frameworks
- DOSA: One-click open-source → github.com/cloudFPGA/DOSA
  - Increase scope of potential solutions
  - Automatic distribution across FPGAs
- Efficient automated distributed DNN inference:
  - >50x speedup, >80x more efficient vs. CPU
  - >18x speedup, >30x more efficient vs. GPUs
- → An "FPGA standard stack" must be open source!



|          | Measured<br>throughput<br>(fps) | Latency<br>(ms) | Average<br>Power<br>(W) | Total energy<br>per inference<br>(J) |
|----------|---------------------------------|-----------------|-------------------------|--------------------------------------|
| 156 MHz) | 3,853.25*                       | 0.259           | 77.31                   | 0.020                                |
| 2.4 GHz) | 73.38                           | 13.627          | 123.69                  | 1.686                                |
| MHz)     | 211.81                          | 4.721           | 129.51                  | 0.676                                |

- research.ibm.com/projects/cloudfpga
- in burkhard-ringlein

# Appendix

B. Ringlein / Hybrid Cloud Research / HiPEAC 2024 / © 2024 IBM Corporation





As testbed for distributed edge FPGA environments: The IBM cloudFPGA Platform

- 19"x2U w/64 FPGAs
- Network-attached, disaggregated FPGAs
- 640GbE fully balanced

(more information at github.com/cloudfpga)



#### References

Sources are referenced in the slides directly.

All remaining images are from IBM DAM or IBM Websites or created by the author or the EVEREST consortium. © 2023 IBM Corporation.

### IBM cloudFPGA: Further Reading

- https://github.com/cloudFPGA
- The cloudFPGA project page at ZRL: https://www.zurich.ibm.com/cci/cloudFPGA/
- B. Ringlein, F. Abel, D. Diamantopoulos, B. Weiss, C. Hagleitner, and D. Fey, "Advancing Compilation of DNNs for FPGAs using Operation Set Architectures," in IEEE Computer Architecture Letters, 2022.
- B. Ringlein, F. Abel, D. Diamantopoulos, B. Weiss, C. Hagleitner, M. Reichenbach and D. Fey, "A Case for Function-as-a-Service with Disaggregated FPGAs," in Proceedings of the 2021 IEEE 14th International Conference on Cloud Computing (CLOUD).
- B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey, "Programming Reconfigurable Heterogeneous Computing Clusters" Using MPI With Transpilation", 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC).
- B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey, "ZRLMPI: A Unified Programming Model for Reconfigurable Heterogeneous Computing Clusters" in 28th IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM), 2020.
- B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey, "System architecture for network-attached FPGAs in the cloud using partial reconfiguration," in 29th International Conference on Field Programmable Logic and Applications (FPL), 2019.
- F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, S. Paredes, "An FPGA Platform for Hyperscalers," in IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI), Santa Clara, CA, pp. 29–32, 2017.
- F. Abel, "How do you squeeze 1000 FPGAs into a DC rack?" online at LinkedIn

#### Notices

- IBM and the IBM logo are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on ibm.com/trademark.
- Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
- The registered trademark Linux is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on worldwide basis.

