Why GPU Performance Efficiency Beats Peak Performance

10 March 2025
Eleanor Brash

When estimating the performance of a GPU there are typically three metrics that are examined at first glance: the texturing rate (GPixel/s) for graphics workloads and the number of floating point operations (FLOPS) and 8-bit tera operations (TOPS) per second they can handle for compute and AI workloads. These headline numbers, taken alongside area data, power estimates and general feature set, help SoC designers to compare various system configuration performances.

However, these metrics provide only theoretical performance and are not always a good representation of real-world performance. No GPU ever operates at 100% utilisation, and so the next step is to explore the GPU’s real-world, workload-specific performance, typically measured in frames per second (FPS), and to consider overall GPU utilisation. Benchmarks like Manhattan and Aztec provide a useful guide for real-world graphics performance (although they themselves are not fully representative of typical applications).

Often at this stage, different GPU architectures can produce surprising results. Those that are better at translating theoretical performance into real-world performance emerge triumphant, delivering far higher frame rates (FPS) than expected from their headline TFLOPS.

Why is FPS/TFLOPS important?

Because typically a GPU with higher TFLOPS comes with higher silicon area and higher power consumption. If a smaller GPU can deliver the same real-world performance as a theoretically more powerful GPU, designers have a choice: either offer the same performance at lower costs or choose to keep costs the same and put the extra performance and/or efficiency into the hands of end users.

With that in mind, understanding GPU performance efficiency is an essential part of understanding how a GPU will perform in an end device.

Imagination’s PowerVR architecture has been refined over decades to become the most performance efficient embedded GPU IP on the market. Below, we outline the key hardware and software optimisations that enable Imagination’s GPUs to deliver up to twice the FPS/TFLOPS as competitor embedded products.

Why GPU Performance Efficiency Trumps Peak Performance - DIAGRAM BLACK

1. Large, Responsive Register Storage

At 512KB, Imagination GPUs have very large register storage within each arithmetic logic unit (ALU), typically twice what competing embedded GPU designs offer. This allows workloads to avoid lengthy load / store operations from the main GPU memory, which can negatively impact GPU utilisation and efficiency by delaying processing work.

The register banks in the ALU are designed in such a way that many registers can be accessed concurrently. This means that in every cycle, multiple units within the ALU can execute work. For example, FP32 operations can be processed alongside complex operations without any queueing for memory access. Most alternative embedded GPU architectures have limitations on register access, which creates processing stalls as data takes extra cycles to be fetched.

Imagination GPUs are also designed to handle multiple workloads simultaneously. This means that as and when a load / store is required, the pause in processing can be filled in with alternative operations, effectively avoiding latency issues.

2. Specialised Blocks Offload Primary ALUs

Imagination’s ALUs feature several fixed function blocks that allow the GPU to offload lengthy tasks, such as address calculations, away from the primary ALUs, leaving them free to handle general workloads. In contrast most other embedded GPU providers emulate address calculations and complex tasks on the INT32 ALUs, which lowers overall GPU performance efficiency.

3. Overall GPU architecture efficiency

The PowerVR architecture has been a leader in GPU efficiency from its inception thanks to its deferred rendering technique. Very early in the pipeline, Imagination GPUs analyse each frame holistically, determining which fragments are visible and processing only those that the user will see. By removing unnecessary operations as early as possible in the pipeline, Imagination GPUs lower power consumption and boost performance efficiency. Other embedded GPU architectures still process more fragments than necessary, wasting valuable computing resources and bandwidth and thus power on objects that will never be seen.

4. Software to maximise GPU utilisation

While we’ve been looking at performance efficiency from a predominantly graphics perspective, much of the above applies to compute and AI applications as well. To boost performance efficiency further for AI workloads, Imagination offers a set of highly optimised compute libraries for popular operations (imgNN, imgBLAS, imgFFT) that enable programmers to maximise GPU utilisation.

Smarter performance scaling: The Imagination Advantage

The result of all these features speak for themselves. Across all graphics workloads in the chart below, the Imagination GPU surpasses the FPS / TFLOPS achieved by area-equivalent embedded competitor designs. In some cases, the performance efficiency is twice what other GPUs can offer.

Imagination GPU Performance Efficiency Chart Based on Imagination in-house numbers. Competitor devices are run at low clock to avoid host CPU and system bottlenecks in order to get a pure understanding of competitor GPU capabilities.

Demand for GPU performance is booming across all markets, not only for graphics experiences but now, in the era of AI, for their capacity as a flexible, parallel compute processor. Hardware designers have two options to deliver this extra performance: one it to simply build in a GPU with higher theoretical TFLOPS; the other option is to select a GPU with lower theoretical TFLOPS but high-performance efficiency.

To find out more about the real-world performance efficiency that Imagination GPUs can offer, book a meeting with our sales team.