9.3 Code optimizations

Arm® recommends some code optimizations such as using built-in functions or experimenting with your data to increase algorithm performance.

Use vector loads and saves

To load as much data as possible in a single operation, use vector loads. These enable you to load 128 bits at a time. Do the same for saving data.

For example, if you are loading char values, use the built-in function vload16() to load 16 bytes at a time.

Do not try to load more than 128 bits in a single load. This can reduce performance.

Perform as many operations per load as possible

Operations that perform multiple computations per element of data loaded are typically good for programming in OpenCL:

  • Try to reuse data already loaded.
  • Use as many arithmetic instructions as possible per load.
Avoid conversions to or from float and int

Conversions to or from float and int are relatively expensive so avoid them if possible.

Experiment to see how fast you can get your algorithm to execute

There are many variables that determine how well an application performs. Some of the interactions between variables can be very complex and it is difficult to predict how they impact performance.

Experiment with your OpenCL kernels to see how fast they can run:

Data types

Use the smallest data types for your calculation as possible.

For example, if your data does not exceed 16 bits do not use 32-bit types.

Load store types
Try changing the amount of data processed per work-item.
Data arrangement
Change the data arrangement to make maximum use of the processor caches.
Maximize data loaded
Always load as much data in a single operation as possible. Use 128-bit wide vector loads to load as many data items as possible per load.
Use shift instead of a divide

If you are dividing by a power of two, use a shift instead of a divide.


  • This only applies to integers.
  • This only works for powers of two.
  • Divide and shift use different methods of rounding negative numbers.
Use vector loads and saves for scalar data

Use vector load VLOAD instructions on arrays of data even if you do not process the data as vectors. This enables you to load multiple data elements with a single instruction. A vector load of 128-bits takes the same amount of time as loading a single character. Multiple loads of single characters are likely to cause cache thrashing and this reduces performance. Do the same for saving data.

Use _sat() functions instead of min() or max()
_sat() functions automatically take the maximum or minimum values if the values are too high or too low for representation. You are not required to add additional min() or max() code.
Avoid writing kernels that use many live variables
Using too many live variables can affect performance and limit the number of concurrently executing threads per core.
Do not calculate constants in kernels
  • Use defines for constants.
  • If the values are only known at runtime, calculate them in the host application and pass them as arguments to the kernel.

    For example, height-1.

Use the offline compiler to produce statistics

Use the offline compiler to produce statistics for your kernels and check the ratio between arithmetic instructions and loads.

Use the built-in functions

Many of the built-in functions are implemented as fast hardware instructions, use these for high performance.

Use the cache carefully
  • The amount of cache space available per thread is low so you must use it with care.
  • Use the minimum data size possible.
  • Use data access patterns to maximize spatial locality.
  • Use data access patterns to maximize temporal locality.
Use large sequential reads and writes
General Purpose computations on a GPU can make very heavy use of external memory. Using large sequential reads and writes significantly improves memory performance.
Non-ConfidentialPDF file icon PDF version101574_0302_00_en
Copyright © 2019 Arm Limited or its affiliates. All rights reserved.