- 11 February 2016
In my previous post I provided a brief introduction about our target platform (Cocos2d-x on PowerVR GPUs), and a set of profiling rule and tools. In this post I will demonstrate how to work with PVRTune to identify performance bottlenecks.
The first step is to build the game by using the FantasyWarrior3DREADME.md instructions file. After that we can use the PVRHub and PVRTune to record the performance analysis file; I’ve uploaded my recording files here. These files have been recorded in the following environment:
- Hardware information
- Device name: Onda V989 tablet (Allwinner A80 chip, PowerVR Series6 G6230 GPU)
- Software information
- Android version: 4.4.2
- Driver info: Version Rogue_DDK_Android_RSCompute rogueddk 1.4@3234138 (release) sunxi_android
- PVRTuneDeveloper: v14.111.1 (SDK build 3.5@3530647)
- PVRPerfServerDeveloper: v14.111.1 (SDK build 3.5@3533642)
- PVRTrace recording libraries: v20 (SDK build 3.5@3533642)
Identifying performance bottlenecks
We can use PVRTune to identify bottlenecks in this game. I picked a representative frame (2338) in the recording file (*.pvrtune). The following will explain how to identify bottlenecks using the PVRTune. The bottlenecks usually fall into one of five categories:
- CPU limited.
- Vertex limited
- V-Sync limited
- Fragment limited
- Bandwidth limited
A CPU limited application is often identifiable as an application suffering from poor performance or frame rate even though the graphics core usage is not high. In PVRTune this can be very easily identified since CPU limited applications have a CPU load that is at or near one hundred percent (a).
Other identifying factors include gaps in the shader load, caused by the PowerVR hardware going to sleep while waiting for CPU operations to complete (b) or the GPU waiting for the next vsync interval.
For this game we captured the following data:
We could see that the CPU load sits at just 12.0%, but there are really lots of big gaps in Tiler and Renderer timing blocks. So the PowerVR hardware has to sleep while waiting for instructions from the render thread 10612. So the biggest problem in this game is that the Cocos2d-x engine does not have an individual thread for rendering. For every frame, the render task must wait for the game logic to finish. You can find big gaps between graphics API calls for every frame in third row timing block. This means we are in a CPU limited scenario.
Vertex limited applications are applications where the bottleneck comes from processing either large amounts of vertices per frame, or from the use of a complex vertex shader, or both. This can be identified by large gaps between Renderer tasks (a) while there is little or no gap between Tiler tasks (b).
Further information can be gained from the processing load in the Vertex and the Tiler load counters. If the Tiler active indicator is high (c) but Processing load: Vertex is not then the scene has too many vertices in it and the cost is coming from the tiling process. On the other hand, if Processing load: Vertex is high (d) but Tiler load is not, then the bottleneck is likely to be in the vertex shader.
Here is the data we got for Fantasy Warrior 3D:
Obviously, there are lots of gaps in both Renderer tasks and Tiler tasks. The average Processing load: Vertex is 1.6%, the peak value is 14.4%, and the average Tiler active is 10%. Although the frame average Processing load: Vertex and Tiler active are actually very low, we can still optimize the vertex shader with PVRShaderEditor. Luckily, Fantasy Warrior 3D is not vertex limited.
Vertical synchronization (V-sync) is a display option that forces an application to synchronize graphical updates with the update rate of the screen. This causes some frames to be slightly delayed and enforces a maximum refresh rate, but reduces screen tearing and can save power. V-sync limited applications are often characterized by intermittent gaps between frames in the graph view, and the frame rate appears to be limited at a set maximum value. If possible, v-sync should be disabled when profiling an application as it adds noise to the PVRTune output and this makes it more difficult to diagnose where optimization work could be beneficial or if completed optimization has been successful.
If we analyze the data for Fantasy Warrior 3D, we see the gaps between frames are very stable (1-2 ms). In addition, the FPS rate is 29.3 for this frame. Since the FPS rate for each frame does not appear to be limited at a set maximum value, the game is not V-sync limited.
Fragment limited applications are very common and occur in most scenes that have fewer vertices than the number of pixels in the framebuffer. Fragment limited applications can be identified when there is the presence of no gaps between Renderer tasks (a), large gaps between Tiler tasks (b) or a high value of Processing load: Pixel (c).
But for this game we got data like the following:
The Processing load: pixel is 46.3% and there are always large gaps in Renderer tasks. So Fantasy Warrior 3D is not fragment limited.
Cases of bandwidth limited applications are both hard to visualize and identify, as they may appear as other bottlenecks. Programs may be bandwidth limited if:
- Timeline shows the application to be fragment limited but the Processing load: Pixel is low.
- Timeline shows the application to be vertex limited but the Processing load: Vertex and Tiler active are low.
Other instances of bandwidth limitation may occur. For example, bandwidth in System-on-Chip (SoC) devices is shared among all components of the chip. Non-graphics processor areas of the chip (the CPU, for example) using large amounts of bandwidth may still cause application graphics to be bandwidth limited. This is platform specific and, as such, there is no counter to record it. As a rule of thumb, action should always be taken to reduce bandwidth use whenever possible through the correct use of texture compression, mesh optimization, and by avoiding unnecessary texture reads, etc.
According to the conclusions of the Vertex limited and Pixel limited sections of this article, this game is not bandwidth limited.
This game is a typical CPU limited case. As discussed in the CPU limited section of this article, moving the OpenGL ES call submission to a dedicated CPU thread will keep the GPU busy and should improve the framerate on many devices. In the next post, I will discuss how the advanced features of PVRTune can be used to isolate the specific causes of performance bottlenecks in Fantasy Warrior 3D.
Please let us know if you have any feedback on the materials published on the blog and leave a comment on what you’d like to see next. Make sure you also follow us on Twitter (@ImaginationTech) for more news and announcements from Imagination.