9.3 Code optimizations
ARM® recommends some code optimizations such as vectorizing your code, using built-in functions, or experimenting with your data to increase algorithm performance.
- Vectorize your code
Mali™ GPUs perform computation with vectors. These enable you to perform
multiple operations per instruction.
Vectorizing your code makes the best use of the Mali GPU hardware so
ensure that you vectorize your code for maximum performance.
Mali GPUs contain 128-bit wide vector registers.
- The Midgard compiler can auto-vectorize some scalar code.
- Vectorize incrementally
Vectorize in incremental steps. For example, start processing one pixel
at a time, then two, then four.
- Use vector loads and saves
Use vector loads to load as much data as possible in a single
operation. These enable you to load 128 bits at a time. Do the same for saving data.
For example, if you are loading char values, use the built-in function
vload16() to load 16 bytes at a time.
Do not try to load more than 128 bits in a single load. This can reduce
- Perform as many operations per load as possible
Operations that perform multiple computations per element of data
loaded are typically good for programming in OpenCL:
- Try to reuse data already loaded.
- Use as many arithmetic instructions as possible per load.
- Avoid conversions to or from
- Conversions to or from
int are relatively expensive so avoid them if
- Experiment to see how fast you can get your algorithm to execute
There are many variables that determine how well an application
performs. Some of the interactions between variables can be very complex and it is
difficult to predict how they impact performance.
Experiment with your OpenCL kernels to see how fast they can run:
- Data types
Use the smallest data types for your calculation as possible.
For example, if your data does not exceed 16 bits do not use
32-bit types. You can fit eight 16-bit words into a 128-bit wide vector but only
four 32-bit words.
- Load store types
- Try changing the amount of data processed per work-item.
- Data arrangement
- Change the data arrangement to make maximum use of the processor
- Maximize data loaded
- Always load as much data in a single operation as possible. Use
128-bit wide vector loads to load as many data items as possible per load.
- Avoid processing single values
- Avoid writing kernels that operate on single bytes or other small
values. Write kernels that work on vectors.
- Use shift instead of a divide
If you are dividing by a power of two, use a shift instead of a
- This only works for powers of two.
- Divide and shift use different methods of rounding negative
- Use vector loads and saves for scalar data
Use vector load
VLOAD instructions on
arrays of data even if you do not process the data as vectors. This enables you to load
multiple data elements with a single instruction. A vector load of 128-bits takes the
same amount of time as loading a single character. Multiple loads of single characters
are likely to cause cache thrashing and this reduces performance. Do the same for saving
- Use the precise versions of built-in functions
Use the precise versions of built-in functions.
native_ versions of built-in functions provide no extra
performance. The following functions are exceptions:
- Use _sat() functions instead of min() or max()
_sat() functions automatically take the
maximum or minimum values if the values are too high or too low for representation. You
are not required to add additional
- Avoid writing kernels that use many live variables
- Avoid writing kernels that use many live variables. Using too many live
variables can impact performance and limits the maximum work-group size.
- Do not calculate constants in kernels
- Use the offline compiler to produce statistics
- Use the mali_clcc offline
compiler built from the DDK tree to produce statistics for your kernels and check the
ratio between arithmetic instructions and loads.
- Use the built-in functions
Many of the built-in functions are implemented as fast hardware
instructions. You can use some of these for high performance.