ARM® Mali™ GPU OpenGL ES Application Optimization Guide

Version: 2.0

Table of Contents

About this book
Intended audience
Using this book
Additional reading
Feedback on this product
Feedback on content
1. Introduction
1.1. About optimization
1.2. How to use this guide
1.3. The Mali GPU hardware
1.3.1. Tile based rendering
1.3.2. Mali GPU hardware components
1.3.3. The vertex processor
1.3.4. The fragment processors
1.3.5. L2 cache controller
1.4. The graphics pipeline
1.4.1. OpenGL ES Graphics pipeline overview
1.4.2. Initial processing
1.4.3. Per-vertex operations
1.4.4. Rasterization and fragment shading
1.4.5. Blending and framebuffer operations
1.5. Differences between desktop systems and mobile devices
1.6. Differences between mobile renderers
1.6.1. Differences with other mobile GPUs
1.6.2. Differences with software renderers
2. Optimization Checklist
2.1. About the optimization checklist
2.2. The checklist
2.2.1. Check the display settings
2.2.2. Use direct rendering if possible
2.2.3. Use the correct tools with the correct settings
2.2.4. Remove debugging information
2.2.5. Avoid infinite command lists
2.2.6. Avoid calls that stall the graphics pipeline
2.2.7. Do not compile shaders every frame
2.2.8. Use VSYNC
2.2.9. Use graphics assets that are appropriate for your platform
2.2.10. Do not use 24-bit textures
2.2.11. Use mipmapping
2.2.12. Use texture compression
2.2.13. Reduce memory bandwidth usage
2.2.14. Use Vertex Buffer Objects
2.2.15. Ensure your application is not application-processor bound
2.3. Checklist for porting desktop applications to mobile devices
2.4. Check system settings
2.5. Final release checklist
3. The Optimization Process
3.1. The steps in the optimization process
3.1.1. About the optimization process
3.1.2. Take measurements
3.1.3. Locate the bottleneck
3.1.4. Determine the optimization
3.1.5. Apply the optimization
3.1.6. Verify the optimization
3.1.7. Repeat the optimization process
3.2. General optimization advice
3.2.1. Experiment with different approaches
3.2.2. Use frame time instead of FPS for comparisons
3.2.3. Set a computation budget and measure against it
3.2.4. Bottlenecks move between processors
4. Taking Measurements and Locating Bottlenecks
4.1. About taking measurements and locating bottlenecks
4.2. Procedure for taking measurements and locating bottlenecks
4.3. Taking measurements
4.3.1. Take accurate measurements
4.3.2. First Mali GPU counter measurements
4.4. Analyzing graphs
4.5. Locating bottlenecks with DS-5 Streamline
4.5.1. About DS-5 Streamline
4.5.2. GPU counters in DS-5 Streamline
4.5.3. Analyzing graphs in DS-5 Streamline
4.5.4. DS-5 Streamline displaying high fragment processor usage
4.5.5. Zoomed DS-5 Streamline display
4.5.6. DS-5 Streamline displaying list of functions
4.6. Locating bottlenecks with other tools
4.6.1. Taking measurements without analysis tools
4.6.2. Measurements from other Mali GPU tools
4.6.3. Information from debugging tools
4.6.4. Locating problem areas with comparisons
4.6.5. Techniques for locating problem areas with comparisons
4.7. Isolating specific problem areas
4.7.1. Application is application-processor bound
4.7.2. Application is vertex processing bound
4.7.3. Application is fragment processing bound
4.7.4. Determining if memory bandwidth is the problem
4.8. List of optimizations
4.8.1. Application processing optimizations list
4.8.2. API optimizations list
4.8.3. Vertex processing optimizations list
4.8.4. Fragment processing optimizations list
4.8.5. Bandwidth optimizations list
4.8.6. Miscellaneous optimizations list
5. Optimization Workflows
5.1. About optimization workflows
5.2. Measuring the application
5.2.1. Take measurements on real hardware
5.2.2. Taking measurements with DS-5 Streamline
5.2.3. Determining the problem area
5.3. Application-processor bound problems
5.3.1. About application-processor bound problems
5.3.2. Check if the problem is application bound or API bound
5.3.3. Application bound
5.3.4. API bound
5.3.5. Check for too many draw calls
5.3.6. Check usage of VBOs
5.3.7. Check for Pipeline stalls
5.3.8. Check for too many state changes
5.3.9. Other application-processor bound problems
5.4. Vertex-processor bound problems
5.4.1. Check vertex shader time
5.4.2. Check for too many vertices
5.4.3. Check for high PLBU time
5.4.4. Check for culled primitives
5.4.5. Check utilization of VBOs
5.4.6. Other vertex-processor bound problems
5.5. Fragment-processor bound problems
5.5.1. Check for fragment-processor bound problems
5.5.2. Check for fragment shader bound problems
5.6. Bandwidth bound problems
5.6.1. Measure texture cache hit to miss ratio
5.6.2. Check for blitting
5.6.3. Measuring maximum bandwidth
5.6.4. Compare application bandwidth to the maximum bandwidth available
5.6.5. Fragment processor bandwidth bound
5.6.6. Vertex processor bandwidth bound
6. Application Processor Optimizations
6.1. Align data
6.2. Optimize loops
6.3. Use vector instructions
6.4. Use fast data structures
6.5. Consider alternative algorithms and data structures
6.6. Use multiprocessing
7. API Level Optimizations
7.1. Minimize draw calls
7.1.1. About minimizing draw calls
7.1.2. Limitations on combined draw calls
7.1.3. Combining textures in a texture atlas
7.1.4. Combining multiple texture atlases together
7.1.5. Combining text textures in a font atlas
7.2. Minimize state changes
7.3. Ensure the graphics pipeline is kept running
7.3.1. The graphics pipeline
7.3.2. Avoiding calls that stall the graphics pipeline
8. Vertex Processing Optimizations
8.1. Reduce the number of vertices
8.2. Use culling
8.3. Use normal maps to simulate fine geometry
8.4. Use level of detail
9. Fragment Processing Optimizations
9.1. Fragment processor optimizations
9.1.1. Reduce texture bandwidth
9.1.2. Avoid overdraw
9.1.3. Other fragment-processor bound problems
9.2. Fragment shader optimizations
9.2.1. Shorten the shader
9.2.2. Simplify the shader
9.2.3. Reduce the number of branches
9.2.4. Other fragment shader problems
10. Bandwidth Optimizations
10.1. About reducing bandwidth
10.2. Optimize textures
10.2.1. Ensure textures are not too large
10.2.2. Use a texture resolution that fits the object on screen
10.2.3. Use low bit depth textures where possible
10.2.4. Use lower resolution textures if the texture does not contain sharp detail
10.2.5. Textures and lighting maps do not have to be the same size
10.2.6. Reduce the number of textures
10.3. Use mipmapping
10.4. Use texture compression
10.4.1. About texture compression
10.4.2. Suitability of textures for texture compression
10.4.3. Using ETC1 with transparency
10.5. Only use trilinear filtering if necessary
10.6. Reduce bandwidth by avoiding overdraw
10.7. Reduce drawing surfaces with culling
10.8. Reduce bandwidth by utilizing level of detail
11. Miscellaneous Optimizations
11.1. Use approximations
11.1.1. General methods of approximation
11.1.2. Technique specific methods of approximation
11.2. Check the display settings
11.2.1. About display settings
11.2.2. Data conversions caused by incorrect settings
11.2.3. Configuring display settings to avoid conversions
11.2.4. Ensure your application has the correct drawing surface
11.3. Use VSYNC
11.3.1. About VSYNC
11.3.2. Using VSYNC
11.3.3. Potential issues with VSYNC
11.3.4. Triple buffering
11.4. Make use of under-used resources
11.4.1. Use spare resources to increase image quality
11.4.2. Use spare resources to save power
11.4.3. Move operations from the fragment processor to the vertex processor
11.4.4. Move operations from the vertex processor to the fragment processor
11.4.5. Move operations from the application processor to the vertex processor
A. Mali GPU Performance Counters
A.1. Vertex processor performance counters
A.2. Fragment processor performance counters

List of Figures

1.1. Mali-400 MP GPU
1.2. OpenGL ES graphics pipeline flow
3.1. Optimization process steps
3.2. Frame time and FPS
3.3. Frame rate limitations of different system elements
3.4. Ideal application equally limited
4.1. DS-5 Streamline
4.2. DS-5 Streamline counters
4.3. High fragment processor usage
4.4. Zoomed display of high fragment processor usage
4.5. DS-5 function list
5.1. Workflow overview
5.2. Top level workflow
5.3. DS-5 Streamline
5.4. High application processor time workflow
5.5. High vertex processor time workflow
5.6. High fragment processor time workflow
5.7. High fragment shader time workflow
5.8. Bandwidth bound workflow
7.1. API calls with small data payload
7.2. API calls with large data payload
7.3. Texture atlas for sign
7.4. Sign in game
7.5. Texture atlas with multiple signs
7.6. Font atlas
7.7. Graphics pipeline flow with stall
8.1. Section of a world without culling
8.2. Section of a world with culling
8.3. Floor with normal map
8.4. Ceiling with normal map
8.5. Wireframe asteroids with levels of detail
8.6. Asteroids with levels of detail
10.1. Image with mipmap levels
10.2. Mali GPU Texture Compression Tool
10.3. Textures combined from texture atlas to create texture with transparency
10.4. Separate textures combined to create texture with transparency
11.1. Depth of field
11.2. Scene with reflections
11.3. Image display steps
11.4. Screen updates and frame completes
11.5. Screen updates and frame completes with VSYNC
11.6. Screen updates and frame completes with VSYNC reducing frame rate
11.7. Screen updates with triple buffering and VSYNC
11.8. Animated plant

Proprietary Notice

Words and logos marked with or are registered trademarks or trademarks of ARM in the EU and other countries, except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their respective owners.

Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder.

The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are excluded.

This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss or damage arising from the use of any information in this document, or any error or omission in such information, or any incorrect use of the product.

Where the term ARM is used it means “ARM or any of its subsidiaries as appropriate”.

Confidentiality Status

This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this document to.

Product Status

The information in this document is final, that is for a developed product.

Revision History
Revision A30 March 2011First release
Revision B14 May 2013Second release
Copyright © 2011, 2013 ARM. All rights reserved.ARM DUI 0555B