8.1 The optimization
process for OpenCL applications
To optimize your application, you must first identify the most computationally intensive parts of your application. In an OpenCL application that means identifying the kernels that take the most time.
To identify the most computationally intensive kernels, you
must individually measure the time taken by each kernel:
- Measure individual
Go through your kernels one at a time and:
- Measure the time it takes for several runs.
- Average the results.
It is important that you measure the run times of the individual
kernels to get accurate measurements.
Do a dummy run of the kernel the first time to ensure that
the memory is allocated. Ensure this is outside of your timing loop.
The allocation of some buffers in certain cases is delayed
until the first time they are used. This can cause the first kernel
run to be slower than subsequent runs.
- Select the kernels that take
the most time
- Select the kernels that have the longest run-time
and optimize these. Optimizing any other kernels has little impact
on overall performance.
- Analyze the kernels
Analyze the kernels to see if they contain computationally expensive
how many reads and writes there are in the kernel. For high performance,
do as many computations per memory access as possible.
- For Mali™ GPUs, you can use the Off-line Shader Compiler
to check the balancing between the different pipelines.
- Measure individual parts of the
If you cannot determine the compute intensive part
of the kernel by analysis, you can isolate it by measuring different
parts of the kernel individually.
You can do this by removing different code blocks and measuring
the performance difference each time.
The section of code that takes the most time is the most intensive. Consider
how this code can be rewritten.