9.1 General optimizations

Arm® recommends general optimizations such as processing large amount of data, using the correct data types, and compiling the kernels once.

Use the best processor for the job

GPUs are designed for parallel processing.

Application processors are designed for high-speed serial computations.

All applications contain sections that perform control functions and others that perform computation. For optimal performance use the best processor for the task:

  • Control and serial functions are best performed on an application processor using a traditional language.
  • Use OpenCL on Mali™ GPUs for the parallelizable compute functions.
Compile the kernel once at the start of your application
Ensure that you compile the kernel once at the start of your application. This can reduce the fixed overhead significantly.
Enqueue many work-items

To get maximum use of all your processor or shader cores, you must enqueue many work-items. For example, in a four shader-core Mali GPU system, enqueue 1024 or more work-items.

Process large amounts of data

You must process a relatively large amount of data to get the benefit of OpenCL. This is because of the fixed overheads of starting OpenCL tasks. The exact size of a data set where you start to see benefits depends on the processors you are running your OpenCL code on.

For example, performing simple image processing on a single 640x480 image is unlikely to be faster on a GPU, whereas processing a 1920x1080 image is more likely to be beneficial. Trying to benchmark a GPU with small images is only likely to measure the start-up time of the driver.

Do not extrapolate these results to estimate the performance of processing a larger data set. Run the benchmark on a representative size of data for your application.

Align data on 128-bit or 16-byte boundaries
Align data on 128-bit or 16-byte boundaries. This can improve the speed of loading and saving data. If you can, align data on 64-byte boundaries. This ensures data fits evenly into the cache on Mali GPUs.
Use the correct data types

Check each variable to see what range it requires.

Using smaller data types has several advantages:

  • More operations can be performed per cycle with smaller variables.
  • You can load or store more in a single cycle.
  • If you store your data in smaller containers, it is more cacheable.

If accuracy is not critical, instead of an int, see if a short, ushort, or char works in its place.

For example, if you add two relatively small numbers you probably do not require an int. However, check in case an overflow might occur.

Only use float values if you require their additional range. For example, if you require very small or very large numbers.

Use the right data types

You can store image and other data as images or as buffers:

  • If your algorithm can be vectorized, use buffers.
  • If your algorithm requires interpolation or automatic edge clamping, use images.
Do not merge buffers as an optimization

Merging multiple buffers into a single buffer as an optimization is unlikely to provide a performance benefit.

For example, if you have two input buffers you can merge them into a single buffer and use offsets to compute addresses of data. However, this means that every kernel must perform the offset calculations.

It is better to use two buffers and pass the addresses to the kernel as a pair of kernel arguments.

Use asynchronous operations

If possible, use asynchronous operations between the control threads and OpenCL threads. For example:

  • Do not make the application processor wait for results.
  • Ensure that the application processor has other operations to process before it requires results from the OpenCL thread.
  • Ensure that the application processor does not interact with OpenCL kernels when they are executing.
Avoid application processor and GPU interactions in the middle of processing
Enqueue all the kernels first, and call clFinish() at the end if possible.
Call clFlush() after one or more clEnqueueNDRange() calls, and call clFinish() before checking the final result.
Avoid blocking calls in the submission thread
Avoid clFinish() or clWaitForEvent() or any other blocking calls in the submission thread.
If possible, wait for an asynchronous callback if you want to check the result while computations are in progress.
Try double buffering, if you are using blocking operations in your submission thread.
Batching kernels submission
From version r17p0 onwards, the OpenCL driver batches kernels that are flushed together for submission to the hardware. Batching kernels can significantly reduce the runtime overheads and cache maintenance costs. For example, this reduction is useful when the application is accessing multiple sub-buffers created from a buffer imported using clImportMemoryARM in separate kernels.
The application should flush kernels in groups as large as possible. When the GPU is idle though, reaching optimal performance requires the application to flush an initial batch of kernels early so that the GPU execution overlaps the queuing of further kernels.
Non-ConfidentialPDF file icon PDF version101574_0301_00_en
Copyright © 2019 Arm Limited or its affiliates. All rights reserved.