9.3 Code optimizations

ARM® recommends some code optimizations such as vectorizing your code, using built-in functions, or experimenting with your data to increase algorithm performance.

Vectorize your code
Mali™ GPUs perform computation with vectors. These enable you to perform multiple operations per instruction.
Vectorizing your code makes the best use of the Mali GPU hardware so ensure that you vectorize your code for maximum performance.
Mali GPUs contain 128-bit wide vector registers.


  • The Midgard compiler can auto-vectorize some scalar code.
Vectorize incrementally
Vectorize in incremental steps. For example, start processing one pixel at a time, then two, then four.
Use vector loads and saves
Use vector loads to load as much data as possible in a single operation. These enable you to load 128 bits at a time. Do the same for saving data.
For example, if you are loading char values, use the built-in function vload16() to load 16 bytes at a time.
Do not try to load more than 128 bits in a single load. This can reduce performance.
Perform as many operations per load as possible
Operations that perform multiple computations per element of data loaded are typically good for programming in OpenCL:
  • Try to reuse data already loaded.
  • Use as many arithmetic instructions as possible per load.
Avoid conversions to or from float and int
Conversions to or from float and int are relatively expensive so avoid them if possible.
Experiment to see how fast you can get your algorithm to execute
There are many variables that determine how well an application performs. Some of the interactions between variables can be very complex and it is difficult to predict how they impact performance.
Experiment with your OpenCL kernels to see how fast they can run:
Data types
Use the smallest data types for your calculation as possible.
For example, if your data does not exceed 16 bits do not use 32-bit types. You can fit eight 16-bit words into a 128-bit wide vector but only four 32-bit words.
Load store types
Try changing the amount of data processed per work-item.
Data arrangement
Change the data arrangement to make maximum use of the processor caches.
Maximize data loaded
Always load as much data in a single operation as possible. Use 128-bit wide vector loads to load as many data items as possible per load.
Avoid processing single values
Avoid writing kernels that operate on single bytes or other small values. Write kernels that work on vectors.
Use shift instead of a divide
If you are dividing by a power of two, use a shift instead of a divide.


  • This only works for powers of two.
  • Divide and shift use different methods of rounding negative numbers.
Use vector loads and saves for scalar data
Use vector load VLOAD instructions on arrays of data even if you do not process the data as vectors. This enables you to load multiple data elements with a single instruction. A vector load of 128-bits takes the same amount of time as loading a single character. Multiple loads of single characters are likely to cause cache thrashing and this reduces performance. Do the same for saving data.
Use the precise versions of built-in functions
Use the precise versions of built-in functions.
Usually, the half_ or native_ versions of built-in functions provide no extra performance. The following functions are exceptions:
  • native_sin().
  • native_cos().
  • native_tan().
  • native_divide().
  • native_exp().
  • native_sqrt().
  • half_sqrt().
Use _sat() functions instead of min() or max()
_sat() functions automatically take the maximum or minimum values if the values are too high or too low for representation. You are not required to add additional min() or max() code.
Avoid writing kernels that use many live variables
Avoid writing kernels that use many live variables. Using too many live variables can impact performance and limits the maximum work-group size.
Do not calculate constants in kernels
  • Use defines for constants.
  • If the values are only known at runtime, calculate them in the host application and pass them as arguments to the kernel.
    For example, height-1.
Use the offline compiler to produce statistics
Use the mali_clcc offline compiler built from the DDK tree to produce statistics for your kernels and check the ratio between arithmetic instructions and loads.
Use the built-in functions
Many of the built-in functions are implemented as fast hardware instructions. You can use some of these for high performance.
Related reference
Chapter 10 The kernel auto-vectorizer and unroller
Appendix B OpenCL Built-in Functions
B.3 half_ and native_ math functions
Non-ConfidentialPDF file icon PDF versionARM 100614_0300_00_en
Copyright © 2012, 2013, 2015, 2016 ARM. All rights reserved.