How to Optimise for Compute Tasks on Imagination GPUs

Written by Javier Bizcocho | Sep 30, 2025 10:29:16 AM

The latest refresh of the Developer Documentation (see previous blog post) includes a brand new section that shows developers how to nail performance when running compute tasks on Imagination GPUs.

GPU cores are known for being exceptionally efficient at running compute workloads – particularly so if a developer can optimise the software for the device. They are designed to handle workloads that apply the exact same piece of code on numerous multiple threads; where the operations will only differ in input but still follow the exact same steps, instruction for instruction. While this kind of architecture and processing model was first designed for accelerating modern 3D graphics, it maps incredibly well onto today’s AI models, particularly onto tasks like matrix multiplication and convolutions.

The Imagination GPU architecture itself consists of highly programmable cores that allow for the efficient and high performance execution of general purpose compute. The nature of these cores vary depending on the underlying architecture version and details can be found here. They all support OpenGL ES 3.2, OpenCL 3.0 and Vulkan 1.4.

Our Developer Documentation now gives developers the information they need to make the right decisions when working on our architecture, irrespective of the preferred approach where APIs and programming languages are concerned. When this knowledge is combined with our other developer assets (like the compute libraries and compiler), developers have what it takes to achieve high utilisation, rapid performance and power efficiency.

Here are our top ten tips for optimizing compute performance on Imagination’s PowerVR GPUs. For further tips and more insight, visit the Compute Development Recommendations of our Developer Documentation site.

Design for Parallelism

Tasks need to run on both the CPU and GPU core at the same time for maximum system performance. Consider what can be expressed as a parallel task for execution on a GPU core, leaving the CPU free to handle other tasks.

Understand the GPU Architecture (more)

Each Unified Shading Cluster (USC) within an Imagination GPU can execute an entire work-group independently. Design your workloads to align with the capabilities of your target GPU to avoid underutilisation.

Minimise Divergence Within Work-Groups (more)

Avoid branching logic that causes threads within a work-group to follow different execution paths. Divergence reduces SIMD efficiency.

Optimise Work-Group Sizes (more)

Choose work-group sizes that match the native thread grouping of the target PowerVR core. This ensures full occupancy and maximises parallel execution. The perfect numbers are 32 for Rogue GPUs and 128 on Volcanic GPUs.

Balance the length of the kernel (more)

Short kernels are inefficient due to disproportionate set-up times; longer kernels can cause a bottleneck. Finding the right balance for your application is key.

Provide Enough Data to Keep the GPU Moving (more)

Datasets larger than about 512 items per USC on the device typically provide enough work to maintain high utilisation and occupancy, with larger numbers of items increasing efficiency further.

Avoid Excessive Global Memory Access (more)

System Memory budget is limited and shared between all resources. The performance of many applications will be limited by this resource, making it a prime target for optimisation. Use caching strategies and minimise redundant reads/writes.

Group Memory Accesses Together (more)

Improve efficiency by grouping memory accesses as closely as possible together to be as easy to identify as possible. Generally, putting reads at the start of a kernel and writes at the end allows for the best efficiency.

Insert Barriers Carefully After Local Memory Access (more)

Avoid barriers immediately after accesses to local or constant memory – it stops the compiler from rearranging instructions to hide the latency during this time.

Target the Right API Features (more)

Use API-specific optimisations:

OpenCL: memory objects to be shared between the CPU and GPU should use the flag CL-ALLOC-HOST-PTR.
Vulkan: use the USAGE flag to allocate memory; this will need synchronisation but be wary of duplication.
OpenGL ES Compute: buffer allocation is handled by the driver semi-opaquely with usage hints during buffer allocation; mapping solutions (glMapBufferRange) are preferred to explicit uploads (glBufferSubData) when the data is going to be changed frequently.

If you’re interested in running compute workloads on a GPU at the edge, you might be interested in Imagination’s latest E-Series architecture. This new design integrates an AI accelerator deep inside the GPU’s shaders for use in graphics, compute or AI workloads. To find out more visit the Imagination website.

View full post