2.6.9. Shader programs

This section contains information about recommended practices for shader programs. It also introduces the concept of costs within programs and how your program structure affects execution speed.

Shader program general recommendations

For best performance when using shaders, observe the following recommendations:

Perform shading language compiler calls first

Ensure you make all calls to the shading language compiler during application startup, before you start to supply geometry or texture data to the driver.

The compiler uses main memory for its internal data. This memory space can be reused for data such as user application data, vertex attributes, textures, and polygon lists, after the compiler has finished.

Use custom shader programs

In general, aim to use a large number of shader programs tailored to the requirements of each surface, rather than fewer general purpose shader programs with optional features that are controlled by uniform values. Specialized shaders generally run faster.

Consider program size

You can use the stand-alone version of the Offline Shader Compiler to check the size of programs. You can also use the compiler to experiment with programs and to see how your changes affect the number of instructions.

Note

The sizes reported by the compiler relate to the native size of instruction words in the hardware. Each instruction word can contain a number of ESSL operations.

Looping and conditional branching

Do not unroll your loops manually. Instead, organize your data in arrays and process these with a for statement where possible. Also, use if statements when doing so is natural. If it is beneficial to do so, the compiler unfolds the if statement to execute both branches and select between the results.

Avoid using too many varyings when using ESSL

When programming shaders in ESSL, economize on the number of varyings used in the fragment shader program. Varyings consume memory bandwidth when they are transferred between the geometry processor and memory, and between memory and the pixel processor.

Avoid using too many matrix multiplications

Multiplying a 4x4 matrix with a 4x vector involves 16 multiplications and 12 additions. This is therefore expensive.

If you must multiply a vector with multiple matrices inside the shader, multiply them onto the vector one at a time, rather than multiplying the matrices together first.

Estimating program costs

The way you write your shader ultimately has some impact on the execution speed. Because packing individual operations into the instruction words is a complex combinatorial job, it is not possible to give simple numbers for the cost of any single programming construct. However, it is possible to define a relative cost for program constructs, where the following terms are used:

Free

Operation has little or no impact on program execution speed.

Low

Simple and fast operations that have low impact on execution speed.

Medium

These operations have an intermediate impact on execution speed, and are likely to cost between 2-5 times the cost of a low-cost operation.

High

These operations have the highest impact on execution speed, and are likely to cost between 5-20 times the cost of a low-cost operation.

Table 2.1 defines the likely relative costs of various program constructs. Consider these costs when developing your applications.

Table 2.1. Relative costs of common shader program operations

OperationExampleGeometry processorPixel processor
swizzle.yxFree[a]Freea
negative-xFreeFreea
absoluteabsLowFreea
clamp to [0,1]clamp(x,0.0,1.0)LowFreea
other clamps

clamp(x,-1.0,1.0)

LowLow
arithmetic operators+, -, *LowLowa
minimum, maximummin, maxLowLowa
comparison MediumLowa
access local variable FreeaFreea
access uniform LowaLowa
access varying LowaMediuma
dividea/bMediumMedium
square rootsqrtMediumMedium
reciprocal1/xMediumMedium
exponential, logarithmexp, logMediumMedium
trigonometriccos, sinHighMedium
powerpowHighMedium
array indexing MediumMediuma
vector indexed with variable HighMedium
conditional statementsif, forMediumLow

[a] Operations that the corresponding processor can do on all four components of one vector in one sub-instruction.


Note

Although Table 2.1 indicates the relative costs of various programming constructs, use the Offline Shader Compiler to obtain a more accurate idea of the likely cost of your programs.

Copyright © 2007-2009 ARM. All rights reserved.ARM DUI 0363D
Non-Confidential