7.3.2 Locate and remove device optimizations

There are optimizations for alternative compute devices that have no effect on Mali™ GPUs, or can reduce performance. To retune the OpenCL code for Mali GPUs, you must first remove all types of optimizations to create a non device-specific reference implementation.

These optimizations are the following:
Use of local or private memory
Mali GPUs use caches instead of local memories. The OpenCL local and private memories are mapped into main memory. There is therefore no performance advantage using local or private memories in OpenCL code for Mali GPUs.
You can use local or private memories as temporary storage, but memory copies to or from the memories are an expensive operation. Using local or private memories can reduce performance in OpenCL on Mali GPUs.
Do not use local or private memories as a cache because this can reduce performance. The processors already contain hardware caches that perform the same job without the overhead of expensive copy operations.
Some code copies data into a local or private memory, processes it, then writes it out again. This code wastes both performance and power by performing these copies.
Data transfers to or from local or private memories are typically synchronized with barriers. If you remove copy operations to or from these memories, also remove the associated barriers.
Cache size optimizations
Some code optimizes reads and writes to ensure data fits into cache lines. This is a very useful optimization for both increasing performance and reducing power consumption. However, the code is likely to be optimized for cache line sizes that are different than those used by Mali GPUs.
If the code is optimized for the wrong cache line size, there might be unnecessary cache flushes and this can decrease performance.


Mali GPUs have a 64-byte cache line size.
Use of scalars
Some GPUs work with scalars whereas Mali GPUs use scalars and 128-bit vectors. Vectors process multiple elements simultaneously enabling higher data throughput.
Modifications for memory bank conflicts
Some GPUs include per-warp memory banks. If the code includes optimizations to avoid conflicts in these memory banks, remove them.
Optimizations for divergent threads, warps, or wavefronts
Some GPU architectures group work-items together into what are called warps or wavefronts. All the work-items in a warp must proceed in lock-step together in these architectures and this means branches can perform badly.
Threads on Mali GPUs are independent and can diverge without any performance impact. If your code contains optimizations or workarounds for divergent threads in warps or wavefronts, remove them.


Mali GPUs do not use warps or wavefronts.
Non-ConfidentialPDF file icon PDF versionARM 100614_0300_00_en
Copyright © 2012, 2013, 2015, 2016 ARM. All rights reserved.