## BetterTogether



## An Interference-Aware Framework for Fine-grained Software Pipelining on Heterogeneous SoCs

Yanwen Xu, Rithik Sharma, Zheyuan Chen, Shaan Mistry, Tyler Sorensen







# Motivation: Accelerating Computation at **Edge**



- Lower latency
- Energy efficiency
- Privacy benefits



Google Pixel

NVIDIA Jetson Thor



## Motivation: Accelerating Computation at Edge



Stages depends on data from previous stages

## Motivation: Accelerating Computation at Edge





Google Tensor G2



## PU Profiling

Mali-G710 MP7

2 Cortex-X1

2 Cortex-A78

4 Cortex-A55



All Available PUs on Pixel

 Ran each stage on each available PUs



## PU Profiling

Mali-G710 MP7

2 Cortex-X1

2 Cortex-A78

4 Cortex-A55



All Available PUs on Pixel

 Ran each stage on each available PUs



## PU Profiling

Mali-G710 MP7

- 2 Cortex-X1
- 2 Cortex-A78
- 4 Cortex-A55



All Available PUs on Pixel

- Ran each stage on each available PUs
- Found optimal Stage->PU mapping

























#### Challenge I: Interference

- When PUs fully utilized
- Slowdowns and Speedups\* due to
  - Resource contention
  - Dynamic voltage and frequency scaling (DVFS)
  - Thermal throttling
  - Power management









## Finding the Optimal Schedule is Hard

• Schedule = mapping from program stages to appropriate PU

| Application Stages |                  |  |  |  |
|--------------------|------------------|--|--|--|
| <b>S1</b>          | Morton Encoding  |  |  |  |
| S2                 | Sort             |  |  |  |
| S3                 | Build Radix Tree |  |  |  |
|                    |                  |  |  |  |
| S7                 | Link Nodes       |  |  |  |





#### Challenge II: Portability

- Large design exploration space
  - e.g., 9 stage AlexNet  $5^9 \approx 1.9 \text{ M}$  potential schedules
  - ~37 years for Google Pixel 7a
- Schedules are not portable
  - Optimal schedule on Pixel does not work on NVIDIA Jetson



The scheduling framework need to be **portable** and **flexible**, and **suitable** for rapid development



Apple A18 SoC

5 GPU-cores

8 NPU-cores

2 P-cores

4 E-cores



Google Pixel 7a

6 E-cores

7 GPU-cores

1 TPU

2 P-cores

2 M-cores

4 E-cores





• A performance modeling approach that accounts for intra-application interference























- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices













- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices

Consists of 3 major components



- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices

- Consists of 3 major components
  - BetterTogether Profiling
    - Attack Challenge I (Interference)













- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices

- Consists of 3 major components
  - BetterTogether Profiling
    - Attack Challenge I (Interference)
  - BetterTogether Optimizer
    - Attack Challenge II (Portability)













- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices

- Consists of 3 major components
  - BetterTogether Profiling
    - Attack Challenge I (Interference)
  - BetterTogether Optimizer
    - Attack Challenge II (Portability)
  - BetterTogether Implementor
    - Efficient static heterogenous pipeline execution





Affinity Map

4x Little Cores (PU1)

2x Medium Cores (PU2)

2x Big Cores (PU3)

1x iGPU (PU4)







Stage 7

Build

Octree



```
void run_stage_1_cpu(in, out, N) {
    #pragma omp parallel for
    for (int i = 0; i < N; ++i)
        out[i] = morton32(in[i]);
}</pre>
```

CPU Code (e.g., OpenMP)

```
global___ void run_stage_1_gpu(in, out, N) {
   int idx = threadIdx.x + blockDim.x * blockIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (int i = idx; i < N; i += stride)
      out[i] = morton32(in[i]);
}</pre>
```



Reconstruction

Decompose

into

Radix Tree

Morton

Encoding

Sorting

Duplicates

Stage 5

Edge

Count

Prefix Sum

Octree



```
void run_stage_1_cpu(in, out, N) {
    #pragma omp parallel for
    for (int i = 0; i < N; ++i)
        out[i] = morton32(in[i]);
}</pre>
```

CPU Code (e.g., OpenMP)

```
global___ void run_stage_1_gpu(in, out, N) {
   int idx = threadIdx.x + blockDim.x * blockIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (int i = idx; i < N; i += stride)
      out[i] = morton32(in[i]);
}</pre>
```



## BT-Profiling - Interference aware profiling

- While profiling each {PU × Stage} pair:
  - Concurrently execute similar stages on other PUs
  - Simulate **whole-application** execution to capture resource contention





**Overcome** 

#### **BT-Optimizer**

 We express our optimization problem as linear constrains

d) Notation and Decision Variables:

• We propose a **three-step** optimization approach





Raw BT Profiling results





| Interference-aware<br>Profiling Table |     |     |   |           |  |
|---------------------------------------|-----|-----|---|-----------|--|
|                                       | S1  | S2  | : | <b>S7</b> |  |
| PU0                                   | 2.6 | 3.3 |   | 5.8       |  |
| PU1                                   | 0.8 | 1.5 |   | 1.9       |  |
| PU2                                   | 0.6 | 1.4 |   | 2.1       |  |
| PU3                                   | 0.8 | 9.0 |   | 1.5       |  |

BT Profiling Table



## BT-Optimizer Step 1 - Minimizing Pipeline Bubbles



- Improve utilization by reducing idle gaps (pipeline bubbles) across PUs
- By **reducing bubbles**, we **keep all PUs busy** and improve pipeline throughput.



### BT-Optimizer Step 2: Optimizing Latency

- High utilization ≠ low latency
- Generate K schedules (e.g., k = 20),
  - each with a different assignment of stages to PUs.
- Minimize latency



### BT-Optimizer Optional Step 3: Autotuning



## BT-Optimizer Optional Step 3: Autotuning





#### Workloads Evaluated

- Three edge compute vision tasks
  - 1. AlexNet-dense
  - 2. AlexNet-sparse
  - 3. Octree Construction
- Common in resource-constrained environments







- AlexNet-dense
  - Dense linear algebra
  - CIFAR-10 dataset

- AlexNet-sparse
  - Sparse linear algebra
  - Pruned w/ CONDENSA\*
  - Stored in CSR

- Octree
  - Tree traversals
  - Irregular memory access
  - Sorting
  - Prefix Sum



#### Platforms Evaluated OpenMP Vulkan









Less Powerful GPUs Mix CPUs

> **Powerful GPUs** Little ARM CPUs



| Platform                                     | Backe<br>nd    | CPU                                                                  | CPU<br>Frequency                         | Integrated<br>GPU |
|----------------------------------------------|----------------|----------------------------------------------------------------------|------------------------------------------|-------------------|
| Google Pixel 7A                              | Vulkan         | 2x Cortex-X1<br>2x Cortex-A78<br>4x Cortex-A55                       | 2.85 GHz<br>2.35 GHz<br>1.80 GHz         | Mali-G710<br>MP7  |
| OnePlus 11                                   | Vulkan         | 1x Coretex-X3<br>2x Coretex-A715<br>2x Cortex-A710<br>3x Cortex-A510 | 3.2 GHz<br>2.8 GHz<br>2.8 GHz<br>2.0 GHz | Adreno 740        |
| NVIDIA Jetson Orin<br>Nano                   | CUDA<br>Vulkan | 6x Cortex-A78AE                                                      | 1.7 GHz                                  | Ampere GPU        |
| *NVIDIA Jetson Orin<br>Nano (low-power mode) | CUDA<br>Vulkan | 4x Cortex-A78AE                                                      | ~0.85 GHz                                | Ampere GPU        |

<sup>\*</sup> In low-power mode, 2 cores are shutdown, and overall CPU frequence is reduced by half

#### Results Overview



- Geomean speedup of 2.14x across all workloads, w/ peak of 7.59x
- In 1 case, slowdown

BetterTogether
produces predictions
that closely match
measured execution
time



Without BT show discrepancies



predicted and measured execution

#### We have additional insights in the paper

- Preliminary results on
  - Using Google's EdgeTPU
  - Implemented using nnapi
  - Showing a 1.25x speedup for AlexNet-dense application on top of existing results
  - Showcasing the flexibility and extensibility of *BetterTogether*



#### Conclusion

- We propose **BetterTogether** 
  - An **interference-aware profiling** method that produce accurate profiling tables by accounting for *intra-application interference*
  - an end-to-end static pipeline generator for edge SoCs
- Using BetterTogether, we implemented 3 class of applications
- Evaluated across 4 diverse devices, achieving up to 7.59x (geo. 2.14x)



Yanwen Xu, Rithik Sharma, Zheyuan Chen Shaan Mistry, Tyler Sorensen {yxu83, riksharm, zchen406, sdmistry, tyler.sorensen}@ucsc.edu





**Open-Source Repo** 

github.com/ucscredwood/better-together





# Backup slides



Apple A18 SoC 90mm2 @ TSMC N3E

- big.LITTLE CPU
- Integrated GPU
- Domain Specific Accelerators (DSA)

Modern SoCs integrate diverse Processing Units (PU)

Image by ChipWise: https://chipwise.tech/our-portfolio/apple-a18-a18-pro-die-shot/



Image by ChipWise: https://chipwise.tech/our-portfolio/apple-a18-a18-pro-die-shot/



## Profiling-guided approach (isolated benchmarking)



- Given N stages
- Given M types of PU

Developer



Benchmarking each {*PUx stage*} pair on the target system





Get a N x M profiling table





Optimize

# Profiling-guided approach (isolated benchmarking)



#### BT Implementor

- We define Task
  - light weight, pointing to CPU/iGPU shared memory

 Using concurrent Queue to pass Tasks around stages.

Each chunk process the incoming tasks in respective type for Cores



Processed on 4 small cores

CPU use thread affinity



Processed on GPU

 BetterTogether yields higher correlations



| CIFAR-D- | 0.9740  | 0.9497 | 0.9481 | 0.9472      | 0.9547 |
|----------|---------|--------|--------|-------------|--------|
| CIFAR-S- | 0.9678  | 0.8887 | 0.7005 | 0.7325      | 0.8224 |
| Tree     | 0.9816  | 0.8220 | 0.6532 | 0.6839      | 0.7852 |
| Avg      | 0.9745  | 0.8868 | 0.7673 | 0.7879      | 0.8541 |
|          | OnePlus | Google | Jetson | Jetson (LP) | Avg.   |

BE

0.9

-0.8