At Imagination, we are working to accelerate LLMs on everyday devices. In the first blog of this new two-part series on LLM performance and acceleration, we introduce the key performance metrics: Time to First Token (TTFT) and Inter-Token Latency (ITL). In the next instalment, we will share the work we’re doing on bringing efficient Llama.cpp inference to Imagination GPUs.
If you’ve seen Google’s “AI Overview” or Word predicting your next word, that’s LLMs at work. They’re built on transformer networks, which use attention to focus on the most relevant parts of your input - similar to how you might watch a football match and instinctively follow the player with the ball rather than the other 21 players on the pitch. The amazing thing about LLMs is that by modelling probability, they capture something of human thought processes, giving them tremendous utility in diverse applications.
The challenge is that all this requires heavy computation. LLMs rely on large-scale matrix operations, which are demanding but highly parallel: in other words, perfect for GPUs.
Read "Getting Real About AI Processors" to find out why GPUs are perfect for highly parallel tasks.
That’s why GPUs, including those from Imagination based on the PowerVR architecture, play a key role in making these models fast and efficient, especially on mobile and edge devices where power and performance are critical.
Large Language Models (LLMs) generate text by taking a context window of previous tokens and predicting the next token in the sequence. When a prompt is first submitted, the model must process all tokens in the context window, which can be computationally intensive. Each new token generated by the model is appended to the previous tokens in the context window: an autoregressive model.
The naïve way to proceed with inference would be to process the prompt in its entirety, including the newly generated token, at each step:
Inference would become progressively slower with each new token generated.
To improve efficiency, frameworks often use KV caching, which stores intermediate results from previously processed tokens. This approach avoids redundant computation and significantly accelerates inference, making LLMs practical even on modest hardware. It also keeps execution time approximately constant as new tokens are generated.
Because of KV caching, LLMs typically operate in two distinct modes:
These modes differ in both the user experience and how they stress hardware resources, so each should be measured with its own performance metrics.
So, when talking about the deployment performance of LLMs there are a two metrics that are referred to:
The TTFT metric is the time it takes for the LLM model to generate the first token of output, at which point it must have processed all of the user input prompt (i.e. completion of the prefill stage).
Time to First Token is usually important when LLMs are deployed in automotive scenarios or interfacing applications; humans are used to being “heard” at the speed they talk and if a digital assistant or application doesn’t respond within the time another human would respond then the “experience” of the user starts to deteriorate.
It would also be quite frustrating (by today’s standards anyway, … some of us had to wait for a computer game to load off a tape) if you were to type a question into Google, and it took tens of seconds for an answer to pop up.
So both in data centre implementations and at the edge, the challenge for GPU vendors is to provide a rapid first response to the user, even in power constrained or low connectivity scenarios.
To put some numbers to this, for an LLM to be able to generate the first token for an input query (e.g. for the Llama-3.2-3b model) it needs to process a significant amount of the below matrix-matrix multiplication sizes, where the N parameter is the number of user input tokens, so in this case 13, for the user prompt: “Building a website can be done in 10 simple steps:”.
|
M: 1024, K: 3072, N: 13 |
|
M: 128, K: 32, N: 13 |
|
M: 3072, K: 3072, N: 13 |
|
M: 3072, K: 8192, N: 13 |
|
M: 32, K: 128, N: 13 |
|
M: 8192, K: 3072, N: 13 |
Table 1 - TYPICAL GEMM M, K and N dimensions FOR LLAMA-3.2-3B MODEL.
The matrix-matrix multiplication operation in llama.cpp does the following matrix multiplication:
CT = A * BT
This means that C (NxM) is calculated as A(M*K) * B(N*K). It is also worth noting that C and B are transposed matrices, which means the values of the transposed matrix are swapped over the diagonal compared to the original matrix.
There are some rather large dimensions for the matrix-matrix multiplication operations required for the LLM prefill stage, which is where the PowerVR GPU can speed things up! These matrix-multiplication operations are independent and require minimal interaction, aligning nicely to the massively parallel nature of a GPU’s SIMT architecture.
Several iterations of the above matrix-matrix multiplication sizes must be performed before the model can generate the first output token, and the time it takes to process these matrix multiplications is directly related to the time the user must wait for the model to start generating output.
The second metric when measuring LLM performance is the “Inter-Token Latency”, which is exactly what it says on the tin; it’s the time it takes the model to generate a single new token, or the time between tokens as it generates its output, one token at a time.
This process is slightly different to the process that evaluates the user prompt in that it involves matrix-vector multiplication operations, not matrix-matrix multiplication operations, and the compute intensity of this phase is much reduced thanks to the K-V caching technique discussed in the previous section.
The mathematical operation for the generative (or decode) stage is a series of matrix-vector multiplications, where N (of M, K, N) is always equal to 1 and the last token generated is the single vector input to the next set of matrix-vector multiplications for each layer.
Accelerating matrix-vector multiplications can be done on a GPU, but as they are less compute intensive they can often be throttled by memory bandwidth when offloaded to a GPU, which is why the decode stage of an LLM can be done on a CPU where memory bandwidth may not be so constrained. It is generally considered more difficult to add value to this (decode) phase of LLM compute with a GPU, however being able to offload the decode stage to a GPU can be useful if the main CPU of a deployment SoC is heavily loaded.
|
M: 128, K: 32, N: 1 |
|
M: 1024, K: 3072, N: 1 |
|
M: 3072, K: 3072, N: 1 |
|
M: 3072, K: 8192, N: 1 |
|
M: 32, K: 128, N: 1 |
|
M: 8192, K: 3072, N: 1 |
Table 2 - TYPICAL GEMV CALCULATIONS FOR LLAMA-3.2-3B MODEL
So here ends part one of our two-part blog delving into the world of accelerating LLM inferencing on edge-based devices like the PowerVR GPU. We have introduced the concept of Time to First Token and Inter-Token Latency, and how they apply to the two main stages of LLM compute.
In part two we will look at the code modifications Imagination has made to the Llama.cpp application to support the PowerVR GPU architecture, both via Vulkan and the default OpenCL implementation. We will finish with an analysis of our own optimised OpenCL kernels, specifically tailored to unlock high utilisation on PowerVR GPUs for both matrix-matrix multiplication and matrix-vector multiplication operations when using F16 and quantised weight formats.