The Myth of Custom Accelerators: Embracing Flexibility in Edge AI

27 June 2024
Shreyas Derashri

Advanced compute technologies are now commonplace tools for boosting productivity and transforming our day-to-day experiences.

In automotive, for example, ADAS depends on vehicles having the ability to process a vast array of compute intensive tasks, from the pre-processing of camera data right the way through to sensor fusion and path planning – all without affecting the vehicle’s mileage.

Recent innovations at the edge include Wayve’s LINGO-2, a foundation model that links vision, language, and action to explain and determine driving behaviour. This kind of solution is taking the automotive industry towards a future where AI in vehicles can offer advanced features like intuition, language-responsive interfaces, personalized driving styles, and co-piloting to enhance the automated driving experience.

Elsewhere at the edge, AI laptops offer a host of advantages from boosting productivity with AI-enabled content creation tools to offering the ability to run co-pilots locally without the need to share user data with the cloud. These laptops will need more AI performance than any mobile PC that came before; Microsoft’s newly announced Copilot+ PCs use the GPT-4 model and 40+TOPS – as well as a thin form factor and all day battery life.

Foundation models at the edge

AI has achieved this level of capability not because programmers have finally been able to successfully translate a human’s brain into code, but because researchers have successfully applied the massive level of accelerated computing available in the cloud to general purpose models, as discussed in Rich Sutton’s The Bitter Lesson.

Solutions built by tuning general purpose foundation models, such as GPT-4 named above, are emerging as the preferred approach to delivering AI everywhere. Rather than creating domain-specific algorithms, highly capable models that can be applied across multiple domains are using the resources of the cloud to train with large amounts of multi-modal data before being fine-tuned to fit specific applications and devices.

Foundation Models diagram

To fit at the edge, these tuned models need to run on smaller, far less powerful devices with stringent security standards, limited power supplies and unreliable internet connections. They need to deliver not just basic inference, but also on-device fine-tuning and lifetime learning. What is more, they need to share the SoC with critical day-to-day functions that maintain an optimal user experience such as user interface, image processing and audio processing.

Yet despite the differences in available performance, thermal management techniques and even business models, AI at the edge will benefit from considering the philosophy which made AI in the cloud a success: namely, the utilisation of general purpose methods across everything from accelerator hardware through to AI frameworks. This will allow for easy scaling as the amount of computation available at the increases with continued transistor scaling and new packaging technology.

This understanding underpins Imagination’s bilateral approach to supporting our customers to succeed in AI at the Edge:

By developing software based on open standards
By increasing the hardware capabilities of general purpose compute accelerators

Developing software based on open standards

Imagination is taking a software-first approach to our delivery of AI at the edge in order to maximise the programmability and flexibility of our hardware. Enabling software and toolkits such as optimised libraries provides a mechanism to achieve maximum efficiency and tight control of scheduling and memory management. There is already a growing ecosystem of frameworks and libraries with OpenCL back-ends that accelerate time to market as well as provide an opportunity for higher level optimisation and integration as part of a heterogeneous compute system. It includes AI deployment environments as well as computer vision and other general purpose compute libraries.

Collaboration will be the key to success. Last year, Imagination joined other leading technology companies as a founding member of the UXL Foundation, an organisation touted as the open, cross-platform vendor-neutral rival to NVIDIA’s closed CUDA language. The Foundation is evolving the oneAPI programming model and the DPC++ SYCL implementation. By making the initiative a true open-source project under the Linux Foundation, the UXL Foundation is providing a catalyst for companies like Imagination to bring the benefits of the oneAPI standard, which has already seen widespread adoption in high performance computing, closer to the edge. This will play a significant role in addressing the challenges of rapid software development for compute applications and the reuse of applications across a portfolio of platforms.

Imagination is actively contributing to and influencing the oneAPI standard through the UXL Foundation as we develop and roll out our next generation for compute tools and software stacks for edge platforms. We are working closely with partners and customers to encourage wider participation in, and adoption of, the standard. We seek to empower all stakeholders in the developer journey with readily accessible toolkits for Imagination platforms that will provide fit-for-purpose “functional to performant to optimal” workflows typical of today’s edge compute application development cycles, while also leveraging the benefits of build and runtime target independence.

Increasing the capabilities of general purpose compute accelerators

The second part of Imagination’s approach to helping our customers succeed in AI at the edge involves injecting more compute performance into edge devices while maintaining hardware flexibility and programmability.

At present, compute acceleration at the edge typically takes place on one of the following processor types:

Central Processing Units (CPUs): the traditional control centre and workhorse of a SoC; CPUs are increasingly AI-capable with a level of parallelism (e.g. multi-core) and support for relevant data formats; they can offload more specialised compute processors as needed.
Digital Signal Processors (DSPs): used in various markets including automotive and telecommunications for audio, video, camera, and connectivity processing and, more recently, supporting AI applications with vector processing.
Graphics Processing Units (GPUs): GPUs are, by their very nature, programmable and general purpose. While they were traditionally used just for graphics acceleration, in recent years their parallelism has been applied to compute applications such as super-resolution, point clouds and non-machine learning algorithms and they are increasingly adopting features to allow for low precision arithmetic.
Neural Processing Units (NPUs): highly optimised, domain specific accelerators focused on low precision arithmetic for efficient processing of dense matrix multiplication code commonly found in, for example, the training of deep learning algorithms.

The question for the future is: which of these processor types offer the best foundation for the next generation of AI accelerators at the Edge?

This is the sort of question that Imagination excels at. Our engineers solve technology’s complex problems by creating innovative solutions that empower our customers to succeed. We have over 13Bn chips shipped across four markets and a product range that spans GPU, RISC-V CPU and AI IP as well as Software.

Our engineering teams have extensive experience in designing semiconductor technology for compute and AI, starting with the NNA product line (optimised for CNN style workloads) which is currently shipping in numerous SoCs across the automotive and consumer markets, for example in the DAMO XuanTie TH1520.

Yet despite our customers’ many successes with the NNA, Imagination recognises that AI at the Edge will require the development of either a new generation of more flexible and programmable NPUs, or the development of a new generation of GPU-based accelerators that deliver higher levels of compute performance while maintaining energy efficiency. This aligns with the principle of relying on general rather than overly customised methods that made AI a success in the cloud, and will be made possible thanks to a couple of critical trends in the semiconductor market.

The myth of the custom accelerator

Firstly though, it is worth considering in more detail why more general purpose accelerators are preferable to highly optimised hardware.

The current approach for AI at the edge, particularly in performance focused devices such as cars and laptops, focuses on the NPU: a highly optimised processor that achieves high levels of efficiency in a small area or power budget. The NPU features larger matrix tile sizes when compared to traditional GPU tensor cores, fixed function hardware specifically for neural network acceleration, a focus on low precision number formats, graph compilation and optimisations for reduced data movement and enhanced locality.

Such features have driven significant uptake of NPUs to date. However, using an NPU requires a trade-off between optimal performance for a limited set of use cases, and general purpose application. Edge SoCs targeting AI-related KPIs currently include multiple different flavours of NPU, each targeted at a different flagship workload, but with the understanding that for much of the time the silicon will be dark; it cannot be easily applied to other tasks.

There are additional trade-offs on the software side. Given that there is no universally successful or canonical NPU model, there is no guarantee that one set of workloads will run on another vendor’s NPU at all, let alone optimally. Furthermore, the NPU programming model requires a tall, complex software stack to map the network onto the specific NPU hardware. Developers don’t have direct access to the hardware and are unable to optimise performance.

DSPs present somewhat similar challenges to NPUs on the software front. Often proprietary, companies are left maintaining bespoke and increasingly complex software stacks on their own rather than taking advantage of industry standard toolchains.

There is an inherent danger in the early hardware optimisation for specific use cases that an NPU requires which can be considered even more problematic when we consider the rapid evolution of AI models. In the space of just a couple of years, convolutional neural networks (CNNs) have been strongly challenged as the go-to AI model by transformers. There is a big difference between the two. While CNNs have very high compute density, transformers have less. Transformers may have higher total compute load but the ops per byte are lower, which may make extremely high compute density of an NPU of limited utility. Furthermore, CNNs typically operate on layers with one static tensor operand (weights) and one dynamic operand (activations). Transformers have this pattern but also dynamic-dynamic matmuls. This makes it hard for an NPU to benefit from optimisations which arise from being able to treat one operand as static.

It is worth noting that the automotive market, with its long product life cycle where vehicles are expected to be on the road fifteen years after design, has the particular challenge of both running legacy CNN workloads efficiently while also future proofing their platforms for what, in the era of the software-defined vehicle, may come next.

So, edge AI hardware must have the programmability and flexibility to tolerate evolving workload requirements. But flexibility alone is not enough to bring success to AI at the Edge – performance is also required.

Low precision number formats

One of the first key trends in semiconductor computing that will lift up the compute performance of general purpose accelerators (such as GPUs) is the proliferation of lower precision number formats. These were historically the domain of the NPU but are becoming increasingly common in other accelerators such as the GPU. Organisations like the Open Compute Project are starting to drive standardisation in lower precision number formats from FP32 right the way to FP4 and micro scaling (MX) compliant formats across CPUs, GPUs, NPUs and more. The expectation is that these number formats will spread from the data centre space throughout the software ecosystem.

The opportunity and challenge of advanced process nodes

Elsewhere, for many years the semiconductor industry has benefited from Moore’s Law: the generational uplift in performance that can be generated from the same area of silicon. Foundries such as Intel, Samsung and TSMC have been fundamental to extracting the benefits of this downward logic scaling. Advanced process nodes are one of the keys for general purpose accelerators to boost their compute performance to the level that AI at the edge requires.

SRAM, however, has proved difficult to shrink. With the increasing demands of performance, data locality and low latency from AI models there is actually increased demand for more SRAM in any given processor, especially a domain specific accelerator such as an NPU. The question for the future is, can we really afford to have this one very expensive resource dedicated to a single processor that is only active when its function is needed?

And at the same time as transistor density increases, thermal management becomes an even larger challenge than it is now. Highly optimised, power hungry accelerators exacerbate this challenge, creating workload-specific hotspots within the SoC that are troublesome to relieve.

However, if General Purpose Accelerators like CPUs and GPUs increase their compute capabilities while maintaining energy efficiency, then an edge SoC based on a limited number of energy efficient, general purpose, scalable accelerators is a promising solution to the thermal management challenges of advanced process nodes. The approach minimises dark silicon, gives system designers the opportunity to spread processing throughout the core rather than creating application-specific hotspots, and keeps integration, system and programming complexity under control.

ai-diagram

Next generation technology for AI at the Edge

With these developments in mind, next generation processors based on GPU and RISC-V architectures are well positioned to deliver the high performance, energy-efficient, general purpose acceleration that AI at the Edge requires.

Imagination is a world-leader in edge graphics and compute technology. Our GPUs revolutionised the smartphone market and we never stopped breaking new ground, such as producing the first architecture efficient enough for real-time ray tracing on mobile devices. As GPUs and RISC-V CPUs emerge as processors-of-choice for delivering AI at the Edge, our engineers are developing the solutions that our customers and the wider technology ecosystem will need to succeed.

Specific announcements will follow in the coming months. In the meantime, you can book a meeting with our Sales team to get early access to Imagination’s compute roadmap if:

You are a semiconductor company developing AI-capable SoCs
You are an OEM interested in the technologies that are set to transform the experiences you can offer customers
You are a software company developing AI-based applications