Flow Control on PowerVR - Optimising Shaders

We’re back again with another excerpt from our new documentation website. Today, we’ll be looking at flow control and branching in shaders. This series has already covered a range of topics from mipmapping to balancing GPU workloads and we’re really glad you’ve been enjoying these little titbits so far. Judging by the traffic stats, you’re taking the time to have a read of the site afterwards, so that’s fantastic to see.

For those of you who are unaware, docs.imgtec.com is packed with plenty of information for graphics developers, both new and more experienced, including:

Performance recommendations for PowerVR GPUs
Tutorials on getting started with Vulkan and OpenGL ES
Explanations of graphics techniques like physically-based rendering
And much more…

Those of you who are developing for PowerVR hardware are obviously going to get the most out of this site but everybody is sure to find something to interest them, so why not take a look?

With that preamble out of the way, let’s take a quick look at another one of our PowerVR performance recommendations: flow control in shaders.

Introduction

So we’ll start with the good news: PowerVR hardware supports flow control in both vertex and fragment shaders by default, i.e. without having to explicitly enable any extensions.

That’s great, but what is flow control?

Well, flow control is simply controlling the execution path in a shader through branching or looping using statements like if, else, for, and so on. This can often lead to multiple branching paths within a shader, which are executed based on some kind of condition. Flow control is a very basic concept when programming for a CPU, but it’s slightly more complicated in shaders because of the highly parallelised nature of GPUs.

An example of this branching can be seen in the fragment shader of one of our PowerVR SDK examples, GaussianBlur.

mediump  float  imageCoordX = gl_FragCoord.x - 0.5;
mediump  float  windowWidth = config.x;
mediump  float  xPosition = imageCoordX / windowWidth;
 
mediump vec3 col = vec3(0.0);
 
if (xPosition < 0.5)
{
     col = texture(sOriginalTexture, vTexCoords[NumGaussianWeightsAndOffsets]).rgb;
}
else  if (xPosition > 0.497 && xPosition < 0.503)
{
     col = vec3(1.0);
}
else
{
     col = Weights[0] * texture(sTexture, vTexCoords[0]).rgb +
           Weights[1] * texture(sTexture, vTexCoords[1]).rgb +
           Weights[2] * texture(sTexture, vTexCoords[2]).rgb +
           Weights[3] * texture(sTexture, vTexCoords[3]).rgb +
           Weights[4] * texture(sTexture, vTexCoords[4]).rgb +
           Weights[4] * texture(sTexture, vTexCoords[5]).rgb +
           Weights[3] * texture(sTexture, vTexCoords[6]).rgb +
           Weights[2] * texture(sTexture, vTexCoords[7]).rgb +
           Weights[1] * texture(sTexture, vTexCoords[8]).rgb +
           Weights[0] * texture(sTexture, vTexCoords[9]).rgb;
}
 
oColor = vec4(col , 1);

In the example above, the execution path depends on the value of xPosition. xPosition measures the horizontal position along the screen, so conditional branching can be used to perform different processing on the second half of the image. The result of this branching can be seen clearly in the image of this example below.

Screenshot demonstrating how branching is used in a PowerVR SDK example, GaussiaBlur

In general, when we’re talking about flow control in shaders, we’re usually referring to one of two things:

Static Flow Control

This is a case where a shader has two or more branching paths in code which are conditionally selected depending on the value of some uniform variable. Uniform variables are the same across all vertices/fragments, so the same shader path is executed across all vertices/fragments in a single draw call.

Static flow control is sometimes used to combine many smaller shaders into one large shader (an uber shader!). The shader that is going to be executed is then conditionally selected during runtime. However, often a better solution is just to use preprocessor directives to generate multiple shaders from the uber shader during compilation. This means you can create many shaders from a single source file.

Dynamic Flow Control

This one’s a bit more tricky. Again, a shader with dynamic flow control has multiple branching execution paths but this time the condition which controls the branching changes on a per-vertex or per-fragment basis, often based on texture or vertex attributes. This means the shader could potentially have to execute different paths from one vertex or fragment to the next.

So why is this a problem? Well, a graphics core uses a single instruction, multiple data (SIMD) architecture, which means all processors in the core must execute the same instruction at the same time. If a graphics core is executing a group of shader invocations (for example when a fragment shader is processing a set of fragments) then all of the invocations must follow the same path. This means that during branching the processors will spend time executing instructions that they don’t really need to. This has much more of an unpredictable effect on performance than static flow control.

Recommendations for PowerVR GPUs

So now we’ve covered a little about flow control, here are a few of our recommendations:

Avoid using discard in conditional branches

It is usually best to avoid branching to discard when developing for PowerVR devices. Using discard in the fragment shader negates some of the key benefits of PowerVR’s TBDR architecture.

This mainly affects hidden surface removal (HSR), as this operation assumes all of the fragments of an opaque object are going to be drawn, occluding anything behind them. If fragments can potentially be discarded in the fragment shader the hardware can no longer assume this, meaning the GPU has to wait until the fragment shader has finished before determining which fragments are visible. This invalidates the “deferred” part of the tile-based deferred rendering (TBDR) and can reduce the performance of an application on PowerVR.

Our advice is to use alpha blending instead.

Avoid sampling textures in conditional branches (PowerVR Series5 and Series 5XT only)

When developing for PowerVR Series5 and Series5XT, avoid branching to a texture read, as using a sampler in a dynamic branch qualifies as a dependent texture read.

A dependent texture read occurs when the coordinates used to sample the texture depend on some calculation in the shader rather than on a varyings. In a normal texture read, the hardware can fetch texture data before the fragment shader starts, reducing latency from sampling. In dependent texture reads, the texture coordinates can’t be predicted ahead of time, so texture data can’t be pre-fetched, leading to greater latency and stalls. This can have a really noticeable effect on performance.

From PowerVR Series 6 onwards, dependent texture reads are much more efficient, so this isn’t as important for these architectures, but every little performance boost helps when you’re trying to optimise your application.

Try to use branching to skip unnecessary operations

Finally, it is a good idea to use conditional branching to skip unnecessary operations. This will have the greatest impact on performance when there are a significant number of cases where the condition is met.

Optimising shaders for OpenGL ES 3.0

If you’re using OpenGL ES 3.0 and want to optimise any branching in your shaders, it might be worth taking a look at the extension GL_EXT_shader_group_vote.

To illustrate how this extensions works, consider some basic branching like this:

if  (condition)
     result = do_fast_path();
else
     result = do_general_path();

As mentioned before, sets of shader invocations in a graphics core must all execute the same code path. In the example above, if the condition is true for a single invocation in that group then do_fast_path() will be called on that particular invocation. This leaves the rest of the invocations dormant while waiting for do_fast_path() to return. Once do_fast_path() returns a value, the rest of the invocations can call do_general_path().

This is a bit of a pain because the shader is wasting resources by executing both the fast and the general path. Instead, we can modify the above code using the new built-in functions from this extension:

if  (allInvocationsEXT(condition))
     result = do_fast_path();
else
     result = do_general_path();

The function allInvocationsEXT() only returns true if the given condition is met across the entire set of invocations. This is really useful because it will return the value for all invocations in the group, restricting the group to either executing do_fast_path() or do_general_path() but not both.

GL_EXT_shader_group_vote also provides two other built-in functions like alInvocationsEXT(), which return the same value across all invocations in the same group.

These are:

anyInvocationEXT(bool value) – This returns true if value is true for at least one of the invocations in the group.
allInvocationsEqualEXT(bool value) – This returns true if value is the same for all invocations in the group.

And finally…

For more PowerVR performance recommendations, and other useful developer information, take a look at our regularly-updated website at docs.imgtec.com.

Do feel free to leave feedback through our usual forum.