9.2 Kernel optimizations

ARM® recommends some kernel optimizations such as experimenting with the work-group size and shape, minimizing thread convergence, and using a workgroup size of 128 or higher.

Experiment with work-group size
If you can, experiment with different sizes to see if any give a performance advantage. Sizes that are a multiple of two are more likely to perform better.
If your kernel has no preference for the work-group size, you can pass NULL to the local work size argument of the clEnqueueNDRangeKernel().
Use a work-group size of 128 or 256 if possible
The maximum work-group size is typically 256, but this is not possible for all kernels and the driver suggests another size. A work-group size of 64 is the smallest size guaranteed to be available for all kernels.
If possible, use a work-group size of 128 or 256. These make optimal use of the Mali™ GPU hardware. If the maximum work-group size is below 128, your kernel might be too complex.
Experiment with work-group shape
The shape of the work-group can affect the performance of your application. For example, a 32 by 4 work-group might be the optimal size and shape.
Experiment with different shapes and sizes to find the best combination for your application.
Check for synchronization requirements
Some kernels require work-groups for synchronization of the work-items within the work-group with barriers. These typically require a specific work-group size.
In cases where synchronization between work-items is not required, the choice of the size of the work-groups depends on the most efficient size for the device.
You can pass in NULL to enable OpenCL to pick an efficient size.
Check for inter-thread communication
Use clGetKernelWorkGroupInfo() to check if the device can execute a kernel that requires a minimum of inter-thread communication. If the device cannot execute the kernel, the algorithm must be implemented as a multi-pass algorithm. This involves enqueuing multiple kernels.
Consider combining multiple kernels
If you have multiple kernels that work in a sequence, consider combining them into a single kernel. If you combine kernels, be careful of dependencies between them.
However, do not combine the kernels if there are widening data dependencies.
For example:
  • If there are two kernels, A and B.
  • Kernel B takes an input produced by kernel A.
  • If kernel A is merged with kernel B to form kernel C, you can only input to kernel C constant data, plus data from what was previously input to kernel A.
  • Kernel C cannot use the output from kernel A n-1, because it is not guaranteed that kernel A n-1 has been executed. This is because the order of execution of work-items is not guaranteed.
Typically this means that the coordinate systems for kernel A and kernel B are the same.

Note

If combining kernels requires a barrier, it is probably better to keep them separate.
Avoid splitting kernels
Avoid splitting kernels. If you are required to split a kernel, split it into as few kernels as possible.

Note

Splitting a kernel can sometimes be beneficial if it enables you to remove a barrier.

Note

Splitting a kernel can be useful if your kernel suffers from register pressure.
Minimize thread divergence
It is beneficial to minimize thread divergence, this improves data locality for caching.
To minimize thread divergence, avoid the following:
  • Variable length loops.
  • Asymmetric conditional blocks.
Ensure that the kernels exit at the same time
Branches are computationally cheap on Mali GPUs. This means you can use loops in kernels without any performance impact.
Your kernels can include different code segments but try to ensure the kernels exit at the same time.
A workaround to this is to use a bucket algorithm.
Make your kernel code as simple as possible
This assists the auto-vectorization process.
Using loops and branches might make auto-vectorization more difficult.
Use vector operations in kernel code
Use vector operations in kernel code to help the compiler to map them to vector instructions.
Use a sufficient number of concurrent threads
Use a sufficient number of concurrent threads to hide the execution latency of instructions.
The number of concurrent threads that the shader core executes depends on the number of active registers your kernel uses. The higher the number of registers, the smaller the number of concurrent threads.
The number of registers used is determined by the compiler based on the complexity of the kernel, and how many live variables the kernel has at one time.
To reduce the number of registers:
  • Try reducing the number of live variables in your kernel.
  • Use a large NDRange, so there are many work-items.
  • Make sure that the variables are large, for example, use vectors.
Experiment with this to find what suits your application. You can use the off-line compiler to produce statistics for your kernels to assist with this.
Check if your kernels are small
If your kernels are small, use data with a single dimension and ensure the work-group size is a power of two.
Non-ConfidentialPDF file icon PDF versionARM 100614_0300_00_en
Copyright © 2012, 2013, 2015, 2016 ARM. All rights reserved.