9.6 Mali™ Bifrost and Valhall GPU specific optimizations
Arm recommends some Mali™ Bifrost and Valhall GPU specific optimizations.
Note: Only use these optimizations if you are specifically targeting a Mali Bifrost or Valhall GPU.
- Ensure that the threads all take the same branch direction in
if-statements and loops
- In Mali Bifrost and Valhall GPUs, groups of adjacent threads are
arranged together. Scalar instructions in adjacent threads are executed in
parallel so the GPU operates on multiple data elements simultaneously. The
scalars execute in threads and these must operate in lock-step. If your shaders
contain branches, such as if statements or loops, the branches in adjacent
threads can go different ways. The arithmetic unit cannot execute both sides of
the branch at the same time. The two operations are split and the processing
speed is reduced. To avoid this performance reduction, try to ensure that
adjacent threads all branch the same way.
- The following table shows the number of adjacent threads for each Mali
Bifrost and Valhall GPU.
9-1 Adjacent threads
- Avoid excessive register usage
- Every thread has 64 32-bit working registers. A 64-bit variable
uses two adjacent 32-bit registers for its 64-bit data.
- If a thread requires more than 64 registers, the compiler might
start storing register data in memory. This reduces performance and the
available bandwidth. This is especially bad if your shader is already load-store
- Vectorize 8-bit and 16-bit operations
- For 16-bit operations, use 2-component vectors to get full performance. For basic arithmetic operations, fp16 version is twice as fast as fp32 version.
- For 8-bit types, such as
char, use four-component vectors for best performance.
- Do not vectorize 32-bit operations
- Mali Bifrost and Valhall GPUs use scalars so you are not required to vectorize 32-bit operations. 32-bit scalar and vector arithmetic operations have same performance.
- Use 128-bit load or store operations
- 128-bit load or store operations make the more efficient use of
the internal buses.
- Load and store operations are faster if all threads in a quad
load from the same cache-line
- If all threads in a quad load from the same cache-line, the
arithmetic pipeline only sends one request to the load-store unit to load the
- For example, this example is fast because consecutive threads
load consecutive 128-bit vectors from memory:
global float4 * input_array;
float4 v = input_array[get_global_id(0)];
This second version is slower, because the four threads with
ids load data from
different cache lines.
global float4 * input_array;
float4 v = input_array[4*get_global_id(0)];
Note: One cache line is 512-bits.
- Use 32-bit arithmetic in place of 64-bit if possible
- 64-bit arithmetic operates at half the speed of 32-bit
- Use fine-grained shared virtual memory
- If your system supports it, using the shared virtual memory feature in OpenCL
2.0 provides cache-coherent memory. This reduces the requirement for manually
synchronizing memories and increases performance. See F.8 Shared virtual memory.
- Try to get a good balance of usage of the execution engines and load-store
- If one unit is overused, this can limit the overall performance of the
application the GPU is executing. For example, the load-store unit is overused,
try computing values rather than loading them. If the execution engine is
overused, try loading values instead of computing them.