Non-Confidential | ![]() | 101574_0301_00_en | ||
| ||||
Home > OpenCL optimizations list > Code optimizations |
Arm® recommends some code optimizations such as using built-in functions or experimenting with your data to increase algorithm performance.
To load as much data as possible in a single operation, use vector loads. These enable you to load 128 bits at a time. Do the same for saving data.
For example, if you are loading char values, use the built-in function
vload16()
to load 16 bytes at a time.
Do not try to load more than 128 bits in a single load. This can reduce performance.
Operations that perform multiple computations per element of data loaded are typically good for programming in OpenCL:
Conversions to or from float
and int
are relatively expensive so
avoid them if possible.
There are many variables that determine how well an application performs. Some of the interactions between variables can be very complex and it is difficult to predict how they impact performance.
Experiment with your OpenCL kernels to see how fast they can run:
Use the smallest data types for your calculation as possible.
For example, if your data does not exceed 16 bits do not use 32-bit types.
If you are dividing by a power of two, use a shift instead of a divide.
Use vector load VLOAD
instructions on
arrays of data even if you do not process the data as vectors. This enables you to load
multiple data elements with a single instruction. A vector load of 128-bits takes the
same amount of time as loading a single character. Multiple loads of single characters
are likely to cause cache thrashing and this reduces performance. Do the same for saving
data.
_sat()
functions automatically take the
maximum or minimum values if the values are too high or too low for representation. You
are not required to add additional min()
or max()
code.If the values are only known at runtime, calculate them in the host application and pass them as arguments to the kernel.
For example, height-1
.
Use the offline compiler to produce statistics for your kernels and check the ratio between arithmetic instructions and loads.
Many of the built-in functions are implemented as fast hardware instructions, use these for high performance.