8.1 The optimization process for OpenCL applications

To optimize your application, you must first identify the most computationally intensive parts of your application. In an OpenCL application that means identifying the kernels that take the most time.

To identify the most computationally intensive kernels, you must individually measure the time taken by each kernel:
Measure individual kernels
Go through your kernels one at a time and:
  1. Measure the time it takes for several runs.
  2. Average the results.


It is important that you measure the run times of the individual kernels to get accurate measurements.
Do a dummy run of the kernel the first time to ensure that the memory is allocated. Ensure this is outside of your timing loop.
The allocation of some buffers in certain cases is delayed until the first time they are used. This can cause the first kernel run to be slower than subsequent runs.
Select the kernels that take the most time
Select the kernels that have the longest run-time and optimize these. Optimizing any other kernels has little impact on overall performance.
Analyze the kernels
Analyze the kernels to see if they contain computationally expensive operations:
  • Measure how many reads and writes there are in the kernel. For high performance, do as many computations per memory access as possible.
  • For Mali™ GPUs, you can use the Off-line Shader Compiler to check the balancing between the different pipelines.
Measure individual parts of the kernel
If you cannot determine the compute intensive part of the kernel by analysis, you can isolate it by measuring different parts of the kernel individually.
You can do this by removing different code blocks and measuring the performance difference each time.
The section of code that takes the most time is the most intensive. Consider how this code can be rewritten.
Non-ConfidentialPDF file icon PDF versionARM 100614_0300_00_en
Copyright © 2012, 2013, 2015, 2016 ARM. All rights reserved.