- Use the best processor
for the job
GPUs are designed for parallel processing.
Application processors are designed for high-speed serial
All applications contain sections that perform control functions
and others that perform computation.
- Control and serial functions
are best performed on an application processor using a traditional language.
- Use OpenCL on Mali™ GPUs for the parallelizable compute
- Compile the kernel once at the
start of your application
- Ensure that you compile the kernel once at the start
of your application. This can reduce the fixed overhead significantly.
- Enqueue many work-items
To get maximum use of all your processor or shader cores, you must enqueue many
work-items. For example, in a four shader-core Mali GPU system,
enqueue 1024 or more work-items.
If you can perform your computation with fewer processor or
shader cores you can save power by enqueuing fewer work-items.
- Process large amounts of data
You must process a relatively large amount of data
to get the benefit of OpenCL. This is because of the fixed overheads
of starting OpenCL tasks. The exact size of a data set where you
start to see benefits depends on the processors you are running
your OpenCL code on.
For example, performing simple image processing on a single 640x480 image is
unlikely to be faster on a GPU, whereas processing a 1920x1080 image is more likely to
be beneficial. Trying to benchmark a GPU with small images is only likely to measure the
start-up time of the driver.
Do not extrapolate these results to estimate the performance
of processing a larger data set. Run the benchmark on a representative
size of data for your application.
- Align data on 128-bit or 16-byte
- Align data on 128-bit or 16-byte boundaries. This
can improve the speed of loading and saving data. If you can, align
data on 64-byte boundaries. This ensures data fits evenly into the
cache on Mali GPUs.
- Use the correct data types
Check each variable to see what range it requires.
Using smaller data types has several advantages:
- More operations can be performed
per cycle with smaller variables.
- You can load or store more in a single cycle.
- If you store your data in smaller containers, it
is more cacheable.
If accuracy is not critical, instead of an
see if a
in its place.
For example, if you add two relatively small numbers you probably
do not require an
int. However, check in case
an overflow might occur.
float values if you require their additional
range. For example, if you require very small or very large numbers.
- Use the right data types
You can store image and other data as images or
- If your algorithm can be vectorized,
- If your algorithm requires interpolation or automatic edge
clamping, use images.
- Use asynchronous operations
If possible, use asynchronous operations between
the control threads and OpenCL threads. For example:
- Do not make the application
processor wait for results.
- Ensure that the application processor has other operations
to process before it requires results from the OpenCL thread.
- Ensure that the application processor does not interact
with OpenCL kernels when they are executing.
- Do not merge buffers as an optimization
Merging multiple buffers into a single buffer as
an optimization is unlikely to provide a performance benefit.
For example, if you have two input buffers you can merge them
into a single buffer and use offsets to compute addresses of data.
However, this means that every kernel must perform the offset calculations.
It is better to use two buffers and pass the addresses to
the kernel as a pair of kernel arguments.