|Home > OpenCL optimizations list > General optimizations|
Arm® recommends general optimizations such as processing large amount of data, using the correct data types, and compiling the kernels once.
GPUs are designed for parallel processing.
Application processors are designed for high-speed serial computations.
All applications contain sections that perform control functions and others that perform computation. For optimal performance use the best processor for the task:
To get maximum use of all your processor or shader cores, you must enqueue many work-items. For example, in a four shader-core Mali GPU system, enqueue 1024 or more work-items.
You must process a relatively large amount of data to get the benefit of OpenCL. This is because of the fixed overheads of starting OpenCL tasks. The exact size of a data set where you start to see benefits depends on the processors you are running your OpenCL code on.
For example, performing simple image processing on a single 640x480 image is unlikely to be faster on a GPU, whereas processing a 1920x1080 image is more likely to be beneficial. Trying to benchmark a GPU with small images is only likely to measure the start-up time of the driver.
Do not extrapolate these results to estimate the performance of processing a larger data set. Run the benchmark on a representative size of data for your application.
Check each variable to see what range it requires.
Using smaller data types has several advantages:
If accuracy is not critical, instead of an
int, see if a
char works in its
For example, if you add two relatively small numbers you probably do
not require an
int. However, check in case an overflow
float values if you require
their additional range. For example, if you require very small or very large
You can store image and other data as images or as buffers:
Merging multiple buffers into a single buffer as an optimization is unlikely to provide a performance benefit.
For example, if you have two input buffers you can merge them into a single buffer and use offsets to compute addresses of data. However, this means that every kernel must perform the offset calculations.
It is better to use two buffers and pass the addresses to the kernel as a pair of kernel arguments.
If possible, use asynchronous operations between the control threads and OpenCL threads. For example:
clFinish()at the end if possible.
clFlush()after one or more
clEnqueueNDRange()calls, and call
clFinish()before checking the final result.
clWaitForEvent()or any other blocking calls in the submission thread.
clImportMemoryARMin separate kernels.