9.1 General optimizations

ARM® recommends general optimizations such as processing large amount of data, using the correct data types, and compiling the kernels once.

Use the best processor for the job
GPUs are designed for parallel processing.
Application processors are designed for high-speed serial computations
All applications contain sections that perform control functions and others that perform computation.
  • Control and serial functions are best performed on an application processor using a traditional language.
  • Use OpenCL on Mali™ GPUs for the parallelizable compute functions.
Compile the kernel once at the start of your application
Ensure that you compile the kernel once at the start of your application. This can reduce the fixed overhead significantly.
Enqueue many work-items
To get maximum use of all your processor or shader cores, you must enqueue many work-items. For example, in a four shader-core Mali GPU system, enqueue 1024 or more work-items.
If you can perform your computation with fewer processor or shader cores you can save power by enqueuing fewer work-items.
Process large amounts of data
You must process a relatively large amount of data to get the benefit of OpenCL. This is because of the fixed overheads of starting OpenCL tasks. The exact size of a data set where you start to see benefits depends on the processors you are running your OpenCL code on.
For example, performing simple image processing on a single 640x480 image is unlikely to be faster on a GPU, whereas processing a 1920x1080 image is more likely to be beneficial. Trying to benchmark a GPU with small images is only likely to measure the start-up time of the driver.
Do not extrapolate these results to estimate the performance of processing a larger data set. Run the benchmark on a representative size of data for your application.
Align data on 128-bit or 16-byte boundaries
Align data on 128-bit or 16-byte boundaries. This can improve the speed of loading and saving data. If you can, align data on 64-byte boundaries. This ensures data fits evenly into the cache on Mali GPUs.
Use the correct data types
Check each variable to see what range it requires.
Using smaller data types has several advantages:
  • More operations can be performed per cycle with smaller variables.
  • You can load or store more in a single cycle.
  • If you store your data in smaller containers, it is more cacheable.
If accuracy is not critical, instead of an int, see if a short, ushort, or char works in its place.
For example, if you add two relatively small numbers you probably do not require an int. However, check in case an overflow might occur.
Only use float values if you require their additional range. For example, if you require very small or very large numbers.
Use the right data types
You can store image and other data as images or as buffers:
  • If your algorithm can be vectorized, use buffers.
  • If your algorithm requires interpolation or automatic edge clamping, use images.
Use asynchronous operations
If possible, use asynchronous operations between the control threads and OpenCL threads. For example:
  • Do not make the application processor wait for results.
  • Ensure that the application processor has other operations to process before it requires results from the OpenCL thread.
  • Ensure that the application processor does not interact with OpenCL kernels when they are executing.
Do not merge buffers as an optimization
Merging multiple buffers into a single buffer as an optimization is unlikely to provide a performance benefit.
For example, if you have two input buffers you can merge them into a single buffer and use offsets to compute addresses of data. However, this means that every kernel must perform the offset calculations.
It is better to use two buffers and pass the addresses to the kernel as a pair of kernel arguments.
Related reference
Chapter 6 Converting Existing Code to OpenCL
Non-ConfidentialPDF file icon PDF versionARM 100614_0300_00_en
Copyright © 2012, 2013, 2015, 2016 ARM. All rights reserved.