- Experiment with work-group size
If you can, experiment with different sizes to see if
any give a performance advantage. Sizes that are a multiple of two are more likely to
If your kernel has no preference for the work-group size, you can pass
NULL to the local work size argument of the
- Use a work-group size of 128 or 256 if possible
The maximum work-group size is typically 256, but
this is not possible for all kernels and the driver suggests another size. A work-group
size of 64 is the smallest size guaranteed to be available for all kernels.
If possible, use a work-group size of 128 or 256. These make optimal
use of the Mali™
GPU hardware. If the maximum work-group size is below 128, your kernel might be too
- Experiment with work-group shape
The shape of the work-group can affect the performance of your application. For
example, a 32 by 4 work-group might be the optimal size and shape.
Experiment with different shapes and sizes to find the best combination
for your application.
- Check for synchronization requirements
Some kernels require work-groups for synchronization
of the work-items within the work-group with barriers. These typically require a
specific work-group size.
In cases where synchronization between work-items is not required, the
choice of the size of the work-groups depends on the most efficient size for the
You can pass in
NULL to enable OpenCL
to pick an efficient size.
- Check for inter-thread communication
clGetKernelWorkGroupInfo() to check
if the device can execute a kernel that requires a minimum of inter-thread communication.
If the device cannot execute the kernel, the algorithm must be implemented as a multi-pass
algorithm. This involves enqueuing multiple kernels.
- Consider combining multiple kernels
If you have multiple kernels that work in a sequence,
consider combining them into a single kernel. If you combine kernels, be careful of
dependencies between them.
However, do not combine the kernels if there are widening data
- If there are two kernels, A and B.
- Kernel B takes an input produced by kernel A.
- If kernel A is merged with kernel B to form kernel C, you can only
input to kernel C constant data, plus data from what was previously input to kernel
- Kernel C cannot use the output from kernel A n-1, because it is not guaranteed that kernel A n-1 has been executed. This is because the order of execution of work-items
is not guaranteed.
Typically this means that the coordinate systems for kernel A and
kernel B are the same.
Note If combining kernels requires a barrier, it is probably
better to keep them separate.
- Avoid splitting kernels
Avoid splitting kernels. If you are required to split
a kernel, split it into as few kernels as possible.
Note Splitting a kernel can sometimes be beneficial if it
enables you to remove a barrier.
Note Splitting a kernel can be useful if your kernel suffers
from register pressure.
- Minimize thread divergence
It is beneficial to minimize thread divergence, this
improves data locality for caching.
To minimize thread divergence, avoid the following:
- Variable length loops.
- Asymmetric conditional blocks.
- Ensure that the kernels exit at the same time
Branches are computationally cheap on Mali GPUs. This means
you can use loops in kernels without any performance impact.
Your kernels can include different code segments but try to ensure the
kernels exit at the same time.
A workaround to this is to use a bucket
- Make your kernel code as simple as possible
This assists the auto-vectorization process.
Using loops and branches might make auto-vectorization more
- Use vector operations in kernel code
- Use vector operations in kernel code to help the compiler to map them to
- Use a sufficient number of concurrent threads
Use a sufficient number of concurrent threads to hide
the execution latency of instructions.
The number of concurrent threads that the shader core executes depends
on the number of active registers your kernel uses. The higher the number of registers,
the smaller the number of concurrent threads.
The number of registers used is determined by the compiler based on the
complexity of the kernel, and how many live variables the kernel has at one time.
To reduce the number of registers:
- Try reducing the number of live variables in your kernel.
- Use a large NDRange, so there are many work-items.
- Make sure that the variables are large, for example, use
Experiment with this to find what suits your application. You can use
the off-line compiler to produce statistics for your kernels to assist with
- Check if your kernels are small
- If your kernels are small, use data with a single dimension and ensure
the work-group size is a power of two.