9.2 Kernel optimizations

Arm recommends some kernel optimizations such as experimenting with the work-group size and shape, minimizing thread convergence, and using a workgroup size of 128 or higher.

Query the possible workgroup sizes that can be used to execute a kernel on the device

For example:

clGetKernelWorkgroupInfo(kernel, dev, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t)... );
For best performance, use a workgroup size that is between 4 and 64 inclusive, and a multiple of 4

If you are using a barrier, a smaller workgroup size is better.

When you are selecting a workgroup size, consider the memory access pattern of the data.

Finding the best workgroup size can be counter-intuitive, so test different options to see what one is fastest.

If the global work size is not divisible by 4, use padding at the edges or use a non-uniform workgroup size

To ensure the global work size is divisible by 4, add a few more dummy threads.

Alternatively you can let the application processor compute the edges.

You can use a non-uniform workgroup size, but this does not guarantee better performance than the other options.

If you are not sure what workgroup size is best, define local_work_size as NULL

The driver picks the workgroup size it thinks as best. The driver usually selects the work group size as 64.


The performance might not be optimal.
If you want to set the local work size, set the reqd_work_group_size qualifier to kernel functions

This provides the driver with information at compile time for register use and sizing jobs to fit properly on shader cores.

Experiment with work-group size

If you can, experiment with different sizes to see if any give a performance advantage. Sizes that are a multiple of two are more likely to perform better.

If your kernel has no preference for the work-group size, you can pass NULL to the local work size argument of the clEnqueueNDRangeKernel().

Use a work-group size of 128 or 256 if possible

The maximum work-group size is typically 256, but this is not possible for all kernels and the driver suggests another size. A work-group size of 64 is the smallest size guaranteed to be available for all kernels.

If possible, use a work-group size of 128 or 256. These make optimal use of the Mali™ GPU hardware. If the maximum work-group size is below 128, your kernel might be too complex.

Experiment with work-group shape

The shape of the work-group can affect the performance of your application. For example, a 32 by 4 work-group might be the optimal size and shape.

Experiment with different shapes and sizes to find the best combination for your application.

Check for synchronization requirements

Some kernels require work-groups for synchronization of the work-items within the work-group with barriers. These typically require a specific work-group size.

In cases where synchronization between work-items is not required, the choice of the size of the work-groups depends on the most efficient size for the device.

You can pass in NULL to enable OpenCL to pick an efficient size.

Consider combining multiple kernels

If you have multiple kernels that work in a sequence, consider combining them into a single kernel. If you combine kernels, be careful of dependencies between them.

However, do not combine the kernels if there are widening data dependencies.

For example:

  • If there are two kernels, A and B.
  • Kernel B takes an input produced by kernel A.
  • If kernel A is merged with kernel B to form kernel C, you can only input to kernel C constant data, plus data from what was previously input to kernel A.
  • Kernel C cannot use the output from kernel A n-1, because it is not guaranteed that kernel A n-1 has been executed. This is because the order of execution of work-items is not guaranteed.

Typically this means that the coordinate systems for kernel A and kernel B are the same.


If combining kernels requires a barrier, it is probably better to keep them separate.
Avoid splitting kernels

Avoid splitting kernels. If you are required to split a kernel, split it into as few kernels as possible.


  • Splitting a kernel can sometimes be beneficial if it enables you to remove a barrier.
  • Splitting a kernel can be useful if your kernel suffers from register pressure.
Check if your kernels are small
If your kernels are small, use data with a single dimension and ensure the work-group size is a power of two.
Use a sufficient number of concurrent threads

Use a sufficient number of concurrent threads to hide the execution latency of instructions.

The number of concurrent threads that the shader core executes depends on the number of active registers your kernel uses. The higher the number of registers, the smaller the number of concurrent threads.

The number of registers used is determined by the compiler based on the complexity of the kernel, and how many live variables the kernel has at one time.

To reduce the number of registers:

  • Try reducing the number of live variables in your kernel.
  • Use a large NDRange, so there are many work-items.

Experiment with this to find what suits your application. You can use the off-line compiler to produce statistics for your kernels to assist with this.

Optimize the memory access pattern of your application
Use data structures with linear access and high locality. These improve cacheability and therefore performance.
Tune the value of cl_arm_thread_limit_hint to your platform
If you are using the extension cl_arm_thread_limit_hint, the optimal value is different depending on the platform. Tune the value to your platform.


The cl_arm_thread_limit_hint extension is only available on Mali Bifrost GPUs.
Non-ConfidentialPDF file icon PDF version101574_0301_00_en
Copyright © 2019 Arm Limited or its affiliates. All rights reserved.