|Home > OpenCL optimizations list > Kernel optimizations|
Arm recommends some kernel optimizations such as experimenting with the work-group size and shape, minimizing thread convergence, and using a workgroup size of 128 or higher.
clGetKernelWorkgroupInfo(kernel, dev, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t)... );
If you are using a barrier, a smaller workgroup size is better.
When you are selecting a workgroup size, consider the memory access pattern of the data.
Finding the best workgroup size can be counter-intuitive, so test different options to see what one is fastest.
To ensure the global work size is divisible by 4, add a few more dummy threads.
Alternatively you can let the application processor compute the edges.
You can use a non-uniform workgroup size, but this does not guarantee better performance than the other options.
The driver picks the workgroup size it thinks as best. The driver usually selects the work group size as 64.
This provides the driver with information at compile time for register use and sizing jobs to fit properly on shader cores.
If you can, experiment with different sizes to see if any give a performance advantage. Sizes that are a multiple of two are more likely to perform better.
If your kernel has no preference for the work-group size, you can pass
NULL to the local work size argument of the
The maximum work-group size is typically 256, but this is not possible for all kernels and the driver suggests another size. A work-group size of 64 is the smallest size guaranteed to be available for all kernels.
If possible, use a work-group size of 128 or 256. These make optimal use of the Mali™ GPU hardware. If the maximum work-group size is below 128, your kernel might be too complex.
The shape of the work-group can affect the performance of your application. For example, a 32 by 4 work-group might be the optimal size and shape.
Experiment with different shapes and sizes to find the best combination for your application.
Some kernels require work-groups for synchronization of the work-items within the work-group with barriers. These typically require a specific work-group size.
In cases where synchronization between work-items is not required, the choice of the size of the work-groups depends on the most efficient size for the device.
You can pass in
NULL to enable OpenCL
to pick an efficient size.
If you have multiple kernels that work in a sequence, consider combining them into a single kernel. If you combine kernels, be careful of dependencies between them.
However, do not combine the kernels if there are widening data dependencies.
Typically this means that the coordinate systems for kernel A and kernel B are the same.
Avoid splitting kernels. If you are required to split a kernel, split it into as few kernels as possible.
Use a sufficient number of concurrent threads to hide the execution latency of instructions.
The number of concurrent threads that the shader core executes depends on the number of active registers your kernel uses. The higher the number of registers, the smaller the number of concurrent threads.
The number of registers used is determined by the compiler based on the complexity of the kernel, and how many live variables the kernel has at one time.
To reduce the number of registers:
Experiment with this to find what suits your application. You can use the off-line compiler to produce statistics for your kernels to assist with this.
cl_arm_thread_limit_hint, the optimal value is different depending on the platform. Tune the value to your platform.
cl_arm_thread_limit_hintextension is only available on Mali Bifrost GPUs.