4.4.4 About the Mali GPU pipelines
Mali GPUs contain three types of processing pipeline:

Arithmetic pipeline.

The Load/Store pipeline.

The Texture pipeline.
The pipelines all run in parallel. Your shaders typically use all three types
of pipeline.
The Mali Offline Shader Compiler provides numbers of cycles used in each pipeline. The
shader is slowest in the pipeline with the highest number of cycles. Optimize your
shader with optimizations that target that pipeline it is slowest in.
The Arithmetic pipeline
All arithmetic operations consume cycles in the Arithmetic pipeline.
The following are a number of ways you can reduce the Arithmetic pipeline
usage:

Avoid using complex arithmetic such as:

For integer operands, use operations such as shifts to compute
divisions, modulo, and multiplications.

Use transpose instead of inverse for orthogonal matrices.

To avoid computing the transpose, switch the order of operands in a
matrixvector or matrixmatrix multiplication if one of the matrices is
transposed. For example:
Transpose(A)*Vector == Vector * A.
You can also reduce load on the Arithmetic pipe by moving load to the other
pipelines:

Pass matrices as uniforms instead of computing them. This uses the
Load/Store pipeline.

Use a texture to store a set of precomputed values that represent a function
such as sine or cosine. This moves the load to the Texture pipeline.
The Load/Store pipeline
The Load/Store pipeline is used for reading uniforms, writing varyings, and accessing buffers in the shaders such as Uniform Buffer Objects or Shader Storage Buffer Objects.
If your application is Load/Store pipeline bound, try the following
techniques:

Use a texture instead of a buffer object to read data in the
shader.

Compute data using arithmetic operations.

Compress or reduce uniforms and varyings.
The Texture pipeline
Texture accesses use cycles in the Texture pipeline and use memory bandwidth. Using large textures can be detrimental because cache misses are more likely and this can cause multiple threads to stall while waiting for data.
To improve the performance of the Texture pipeline try the following:
 Use mipmaps
 Mipmaps increase the cache hit rate because it
selects the best resolution of the texture to use based on the variation of
texture coordinates.
 Use texture compression
 This is also good for reducing the memory bandwidth and increasing the
cache hit rate. Each compressed block contains more than one texel, so accessing
it makes it more cacheable.
 Avoid trilinear or anisotropic filtering
 Trilinear and anisotropic filtering increase the number of operations required to fetch
texels. Avoid using these techniques unless you absolutely require them.