7.3.2 Locate and
remove device optimizations
There are optimizations for alternative compute devices that have no effect on Mali™ GPUs, or can reduce performance. To retune the OpenCL code for Mali GPUs, you must first remove all types of optimizations to create a non device-specific reference implementation.
These optimizations are the following:
- Use of local
or private memory
Mali GPUs use caches instead of local memories.
The OpenCL local and private memories are mapped into main memory.
There is therefore no performance advantage using local or private
memories in OpenCL code for Mali GPUs.
You can use local or private memories as temporary storage,
but memory copies to or from the memories are an expensive operation.
Using local or private memories can reduce performance in OpenCL
on Mali GPUs.
Do not use local or private memories as a cache because this
can reduce performance. The processors already contain hardware
caches that perform the same job without the overhead of expensive
Some code copies data into a local or private memory, processes
it, then writes it out again. This code wastes both performance
and power by performing these copies.
- Data transfers to or from local or private memories
are typically synchronized with barriers. If you remove copy operations
to or from these memories, also remove the associated barriers.
- Cache size optimizations
Some code optimizes reads and writes to ensure data
fits into cache lines. This is a very useful optimization for both
increasing performance and reducing power consumption. However,
the code is likely to be optimized for cache line sizes that are
different than those used by Mali GPUs.
If the code is optimized for the wrong cache line size, there
might be unnecessary cache flushes and this can decrease performance.
Mali GPUs have
a 64-byte cache line size.
- Use of scalars
Some GPUs work with scalars whereas Mali GPUs use scalars
and 128-bit vectors. Vectors process multiple elements simultaneously enabling higher
- Modifications for memory bank
- Some GPUs include per-warp memory banks. If the
code includes optimizations to avoid conflicts in these memory banks,
- Optimizations for divergent threads,
warps, or wavefronts
Some GPU architectures group work-items together
into what are called warps or wavefronts. All the work-items in
a warp must proceed in lock-step together in these architectures
and this means branches can perform badly.
Threads on Mali GPUs are independent and can diverge without
any performance impact. If your code contains optimizations or workarounds
for divergent threads in warps or wavefronts, remove them.
Mali GPUs do not use warps or wavefronts.