- 15 August 2024
- Eleanor Brash
For every GPU generation the performance teams within Imagination run through a wide range of content, analysing and understanding the different workload types and their bottlenecks. As part of this analysis, the data revealed that many modern games spend an increasing amount of time executing post-processing algorithms to enable depth of field, bloom, blur and other effects.
Most of these post-processing passes are texture-sampling heavy filter effects which are modest in ALU requirements but bottlenecked by the throughput rate of the Texture Processing Unit (TPU). One approach to resolve this would be to simply brute force change the ratio of the number of TPU units versus the USC/ALU rate. However, our analysis indicated this was not a good strategy, for several reasons.
First, in regular render passes the ratio of ALU versus TPU in D-Series GPUs was already optimal and adding another TPU would simply not result in any benefits as the workload would become ALU limited. Meanwhile, other processing passes were TPU-heavy but also bandwidth-heavy, and hence boosting the TPU would not help, as there would be insufficient bandwidth to feed the extra TPU throughput so performance would not be enhanced.
Our teams found that post-processing workloads as well as compute image processing workloads showed the following characteristics:
- Regular processing/sampling across a region, with a large amount of re-use of sampling points which hit on the texture cache;
- 2D sampling of a single render target/texture with no LOD and no perspective.
The above two characteristics led us to implement a new TPU mode in D-Series GPUs which allows the performance to be doubled up but only when the hardware detects these specific characteristics. The first characteristic is important as the regular sampling with high sample reuse (e.g., moving window filters) avoids bandwidth limits. The second is important as it allows us to keep the amount of duplicated logic low, hence offering doubled peak throughput rate but avoiding doubling all TPU logic.
The result of this approach is a modest increase in TPU size but double the performance where it makes sense while remaining in balance with overall characteristics. IMG D-Series GPUs deliver a true speed up and avoid the ALU and/or bandwidth bottleneck cases where the TPU was already fast enough. What this means is that for certain processing types, the DXT-48-1536 will effectively behave like a 96-1536, processing twice the number of bilinear-filtered texture samples per clock and hence delivering twice the execution rate versus the previous CXT-48-1536 generation.
As an example, the illustration below shows a typical mobile game with its render passes. The bar at the top, starting on the left, shows the various Vulkan® Render Passes, with several preprocessing passes which are typically for shadow maps, placing considerable pressure on the depth-test units. The second phase of rendering is the main scene which in this case is a GBuffer render pass and a lighting pass. What we see is that this is the bulk of the processing time for the frame and the ALU and TPU loads are relatively balanced; this is illustrated by the curves in red (TPU load) and in green (ALU Load). We can see that over time both show average utilisation, which is typical for the main scene with a balanced mix of ALU and TPU work.
Most of the interest for us here is the last set of render passes, which are the post-processing passes. Typically, this is where bloom, blur and many other HDR-style post-processing effects are applied on top of the previous main render pass. What is notable here in that zone is that the red TPU curve shoots high for many of them, but the green ALU curve is very low. This indicates the TPU unit is causing a processing bottleneck – which is exactly what the 2D dual-rate TPU is designed to address. It doubled the speed of the TPU for these workloads, thus reducing the rendering time by a factor of two and speeding up frame rendering.
Further details on changes made to the PowerVR architecture in IMG DXT can be found in the white paper Ray Tracing for the Masses.