Why Your Heterogeneous Compute System Isn’t Performing… and What to Do About It

10 June 2025
Ted Hazell

Are you designing multi-core or hybrid CPU/GPU systems, but still not hitting your performance targets? You’re not alone. As system architects strive to build ever more powerful SoCs, the focus is too often on compute abundance. More cores, faster engines, more AI acceleration.

But here’s the reality: if you can’t feed it, you can’t use it.

At Andes RISC-V Con 2025, we teamed up with Baya Systems to dive deep into this very challenge, and the result was eye-opening. Using Baya’s CacheStudio, the two of us modelled cache behaviour across CPU, GPU, and mixed compute systems.

The aim here is to uncover why real-world heterogeneous compute performance often stalls despite all that horsepower.

So, let’s break down what they found, and how it can help you build better, faster, more efficient systems.

For a deeper dive into the data, including full cache analysis charts and workload breakdowns, you can download the complete technical paper here

The Hidden Bottleneck: Data Movement, Not Compute Capacity

Modern SoCs are no longer limited by raw compute potential. Instead, they are increasingly constrained by how efficiently they move data between processing elements and memory hierarchies. The integration of CPUs, GPUs, and accelerators on a single die does not automatically yield performance gains. In fact, without careful architectural coordination, it can introduce contention, latency, and cache inefficiencies.

In this study, cache behaviour was explored as a potential diagnostic tool to help highlight certain limitations and to better understand the trade-offs involved in cache sizing, coherence management, and memory access patterns across heterogeneous compute elements. While these findings offer useful insights, they should be considered within the context of this study’s specific scope and assumptions. Here are some of our findings:

CPU-Only Workloads: Prioritise Temporal Locality and Layered Cache Strategy

L1 Cache: Performance improvements scale predictably with cache size. Increasing L1 from 16 KB to 64 KB raised hit rates from ~94.5% to ~97.8%. This confirms that CPU-bound tasks exhibit strong temporal locality, where recent memory accesses are likely to be reused shortly.

L2 Cache: Hit rates were inversely correlated with L1 sizing. As L1 absorbs more requests, L2’s utilisation decreases, dropping from ~50–56% (with 16 KB L1) to ~14–28% (with 64 KB L1). This highlights that L2 should be optimised for coherence and fallback latency rather than sheer capacity.

L3 Cache: Hit rates remained relatively modest (20–35%), with the primary role being inter-core coherence and DRAM traffic mitigation.

The implications: For CPU-bound workflows, optimal performance is achieved by focusing on a well-sized private L1 per core and tuning L2 for specific latency/coherence trade-offs. L3 becomes relevant mainly in multi-core or shared-memory contexts where DRAM pressure or coherence traffic is high.

GPU-Only Workloads: Size Alone Does Not Compensate for Irregular Access Patterns

L1 Cache: GPU workloads showed lower L1 hit rates than CPUs — rising from ~54% to ~73% as cache increased from 16 KB to 64 KB. The limited benefit is attributed to the divergent and scattered access patterns typical of high-parallelism workloads.

L2 Cache: Performance sharply degraded with larger L1 caches. At 16 KB L1, L2 hit rates reached ~55%, but fell to 6–7% when L1 was increased to 64 KB. This suggests that oversizing upper cache layers can disrupt downstream reuse opportunities.

L3 Cache: Across all configurations, L3 remained underutilised, peaking at ~2.2% hit rate. This likely reflects streaming data patterns and limited inter-thread locality.

The implications: GPU memory hierarchy performance is heavily dependent on software-level access optimisation, such as local store usage, tiling, and explicit synchronisation, rather than relying on traditional cache layering. Hardware improvements must be complemented by workload-aware programming.

Mixed Workloads: Hierarchical Coordination Becomes Critical

L1 Cache: Hit rate improvements were observed across CPU and GPU threads, with increases from ~94% to ~97% between 16 KB and 64 KB.

L2 Cache: Behaviour was highly configuration sensitive. For example, at 256 KB L2 with 16 KB L1, L2 hit rate was 61.7%. However, with 64 KB L1 and 64 KB L2, the hit rate dropped to 23.2%, suggesting cache eviction patterns must be jointly considered.

L3 Cache: L3 provided substantial benefit in scenarios where L1 and L2 were insufficient. With 1024 KB L3, hit rates reached up to 57% for lower L1/L2 configurations.

DRAM Traffic: Memory access rates declined as cache layers were tuned in tandem. From ~385K accesses at minimal cache sizes to ~328K at optimised configurations.

The implications: In heterogeneous environments, cache design cannot be isolated by engine type. The interplay between layers and engines must be carefully architected. L3, often underestimated, becomes vital in reducing DRAM pressure and improving system-wide responsiveness.

Design for Data Flow, Not Just FLOPS

This analysis reinforces a key principle for system architects: performance scaling in heterogeneous compute environments is dictated not by the number of engines, but by how those engines interact with shared memory and interconnect infrastructure.

CacheStudio served not as an end-goal, but as a proxy, helping expose subtle performance degradations and guiding better architectural decisions early in the design lifecycle.

Architect for Interaction, Not Isolation

Across CPU, GPU, and mixed workloads, the research highlights the following:

CPU-only workloads benefit from targeted private L1 and latency-optimised L2 configurations. GPU-only workloads require architectural support for divergent memory access and software-guided optimisation. Mixed workloads benefit most from L3 coherence buffers and balanced cache layering. System-level profiling is essential to anticipate memory pressure and guide cache hierarchy design.

The lesson is clear. Smarter design beats brute force. By focusing on data flow and memory coordination, engineers can unlock the full potential of heterogeneous compute.

Download the full technical paper from Imagination and Baya’s joint Andes RISC-V Con 2025 session below:

Looking to explore the hardware IP that makes this possible? Visit www.imaginationtech.com or www.bayasystems.com.