9.6 Mali™ Bifrost and Valhall GPU specific optimizations

Arm recommends some Mali™ Bifrost and Valhall GPU specific optimizations.

Note:

Only use these optimizations if you are specifically targeting a Mali Bifrost or Valhall GPU.
Ensure that the threads all take the same branch direction in if-statements and loops
In Mali Bifrost and Valhall GPUs, groups of adjacent threads are arranged together. Scalar instructions in adjacent threads are executed in parallel so the GPU operates on multiple data elements simultaneously. The scalars execute in threads and these must operate in lock-step. If your shaders contain branches, such as if statements or loops, the branches in adjacent threads can go different ways. The arithmetic unit cannot execute both sides of the branch at the same time. The two operations are split and the processing speed is reduced. To avoid this performance reduction, try to ensure that adjacent threads all branch the same way.
The following table shows the number of adjacent threads for each Mali Bifrost and Valhall GPU.

Table 9-1 Adjacent threads

GPU Adjacent threads
Mali‑G71 4
Mali‑G72 4
Mali‑G51 4
Mali‑G31 4
Mali‑G52 8
Mali‑G76 8
Mali‑G77 16
Avoid excessive register usage
Every thread has 64 32-bit working registers. A 64-bit variable uses two adjacent 32-bit registers for its 64-bit data.
If a thread requires more than 64 registers, the compiler might start storing register data in memory. This reduces performance and the available bandwidth. This is especially bad if your shader is already load-store bound.
Vectorize 8-bit and 16-bit operations
For 16-bit operations, use 2-component vectors to get full performance. For basic arithmetic operations, fp16 version is twice as fast as fp32 version.
For 8-bit types, such as char, use four-component vectors for best performance.
Do not vectorize 32-bit operations
Mali Bifrost and Valhall GPUs use scalars so you are not required to vectorize 32-bit operations. 32-bit scalar and vector arithmetic operations have same performance.
Use 128-bit load or store operations
128-bit load or store operations make the more efficient use of the internal buses.
Load and store operations are faster if all threads in a quad load from the same cache-line
If all threads in a quad load from the same cache-line, the arithmetic pipeline only sends one request to the load-store unit to load the 512-bit data.
For example, this example is fast because consecutive threads load consecutive 128-bit vectors from memory:
global float4 * input_array;
float4 v = input_array[get_global_id(0)];

This second version is slower, because the four threads with adjacent global ids load data from different cache lines.

global float4 * input_array;
float4 v = input_array[4*get_global_id(0)];					

Note:

One cache line is 512-bits.
Use 32-bit arithmetic in place of 64-bit if possible
64-bit arithmetic operates at half the speed of 32-bit arithmetic.
Use fine-grained shared virtual memory
If your system supports it, using the shared virtual memory feature in OpenCL 2.0 provides cache-coherent memory. This reduces the requirement for manually synchronizing memories and increases performance. See F.8 Shared virtual memory.
Try to get a good balance of usage of the execution engines and load-store units
If one unit is overused, this can limit the overall performance of the application the GPU is executing. For example, the load-store unit is overused, try computing values rather than loading them. If the execution engine is overused, try loading values instead of computing them.
Non-ConfidentialPDF file icon PDF version101574_0301_00_en
Copyright © 2019 Arm Limited or its affiliates. All rights reserved.