ARM® Mali™ GPU OpenGL ES Application Optimization Guide

Version: 3.0


Table of Contents

Preface
About this book
Intended audience
Using this book
Glossary
Conventions
Additional reading
Feedback
Feedback on this product
Feedback on content
1. Introduction
1.1. About optimization
1.2. The Mali GPU hardware
1.2.1. About the Mali GPU families
1.2.2. Utgard architecture hardware
1.2.3. Midgard architecture hardware
1.3. The graphics pipeline
1.3.1. OpenGL ES Graphics pipeline overview
1.3.2. Initial processing
1.3.3. Per-vertex operations
1.3.4. Rasterization and fragment shading
1.3.5. Blending and framebuffer operations
1.4. Differences between desktop systems and mobile devices
1.5. Differences between mobile renderers
1.5.1. Differences with other mobile GPUs
1.5.2. Differences with software renderers
1.6. How to use this guide
2. Optimization Checklist
2.1. About the optimization checklist
2.2. The checklist
2.2.1. Check the display settings
2.2.2. Use direct rendering if possible
2.2.3. Use the correct tools with the correct settings
2.2.4. Remove debugging information
2.2.5. Avoid infinite command lists
2.2.6. Avoid calls that stall the graphics pipeline
2.2.7. Do not compile shaders every frame
2.2.8. Use VSYNC
2.2.9. Use graphics assets that are appropriate for your platform
2.2.10. Do not use 24-bit textures
2.2.11. Use mipmapping
2.2.12. Use texture compression
2.2.13. Reduce memory bandwidth usage
2.2.14. Use Vertex Buffer Objects
2.2.15. Ensure your application is not application-processor bound
2.3. Checklist for porting desktop applications to mobile devices
2.4. Check system settings
2.5. Final release checklist
3. The Optimization Process
3.1. The steps in the optimization process
3.1.1. About the optimization process
3.1.2. Take measurements
3.1.3. Locate the bottleneck
3.1.4. Determine the optimization
3.1.5. Apply the optimization
3.1.6. Verify the optimization
3.1.7. Repeat the optimization process
3.2. General optimization advice
3.2.1. Experiment with different approaches
3.2.2. Use frame time instead of FPS for comparisons
3.2.3. Set a computation budget and measure against it
3.2.4. Bottlenecks move between processors
4. Taking Measurements and Locating Bottlenecks
4.1. About taking measurements and locating bottlenecks
4.2. Procedure for taking measurements and locating bottlenecks
4.3. Taking measurements
4.4. Analyzing graphs
4.5. Locating bottlenecks with DS-5 Streamline
4.5.1. About DS-5 Streamline
4.5.2. GPU counters in DS-5 Streamline
4.5.3. Analyzing graphs in DS-5 Streamline
4.5.4. DS-5 Streamline displaying high fragment processing usage
4.5.5. Zoomed DS-5 Streamline display
4.5.6. DS-5 Streamline displaying list of functions
4.6. Locating bottlenecks with other tools
4.6.1. Taking measurements without analysis tools
4.6.2. Measurements from other Mali GPU tools
4.6.3. Information from debugging tools
4.6.4. Locating problem areas with comparisons
4.6.5. Techniques for locating problem areas with comparisons
4.7. Isolating specific problem areas
4.7.1. Application is application-processor bound
4.7.2. Application is vertex processing bound
4.7.3. Application is fragment processing bound
4.7.4. Determining if memory bandwidth is the problem
4.8. List of optimizations
4.8.1. Application processing optimizations list
4.8.2. API optimizations list
4.8.3. Vertex processing optimizations list
4.8.4. Fragment processing optimizations list
4.8.5. Bandwidth optimizations list
4.8.6. Miscellaneous optimizations list
5. Optimization Workflows
5.1. About optimization workflows
5.1.1. The optimization workflow procedure
5.1.2. Measuring the application
5.1.3. Take measurements on real hardware
5.1.4. Taking measurements with DS-5 Streamline
5.1.5. Determining the problem area
5.2. The initial optimization workflow
5.2.1. Take initial measurements
5.2.2. Determine the problem area
6. Application-Processor Optimization Workflow
6.1. About application-processor bound problems
6.2. Check if the problem is application bound or API bound
6.3. Application bound
6.4. API bound
6.5. Check for too many draw calls
6.6. Check usage of VBOs
6.7. Check for pipeline stalls
6.8. Check for too many state changes
6.9. Other application-processor bound problems
7. Utgard Optimization Workflows
7.1. Utgard architecture vertex processing bound problems
7.1.1. Check vertex shader time
7.1.2. Check for too many vertices
7.1.3. Check for high PLBU time
7.1.4. Check for culled primitives
7.1.5. Check utilization of VBOs
7.1.6. Other vertex processing bound problems
7.2. Utgard architecture fragment-processing bound problems
7.2.1. Check for fragment-processing bound problems
7.2.2. Check for fragment shader bound problems
7.3. Utgard architecture bandwidth bound problems
7.3.1. Measure texture cache hit to miss ratio
7.3.2. Check for blitting
7.3.3. Measuring maximum bandwidth
7.3.4. Compare application bandwidth to the maximum bandwidth available
7.3.5. Fragment processing bandwidth bound
7.3.6. Vertex processing bandwidth bound
8. Midgard Optimization Workflows
8.1. Counters to measure on Midgard architecture Mali GPUs
8.2. Midgard architecture vertex processing bound problems
8.2.1. Check if the application is vertex shader bound
8.2.2. Check for too many vertices
8.2.3. Other vertex processing bound problems
8.3. Midgard architecture fragment-processing bound problems
8.3.1. Check for fragment data processing bound problems
8.3.2. Check for fragment shader bound problems
8.4. Midgard architecture bandwidth bound problems
8.4.1. Measure texture cache hit to miss ratio
8.4.2. Check for blitting
8.4.3. Measuring maximum bandwidth
8.4.4. Compare application bandwidth to the maximum bandwidth available
8.4.5. Midgard fragment processing bandwidth bound
8.4.6. Midgard vertex processing bandwidth bound
9. Application Processor Optimizations
9.1. Align data
9.2. Optimize loops
9.3. Use vector instructions
9.4. Use fast data structures
9.5. Consider alternative algorithms and data structures
9.6. Use multiprocessing
10. API Level Optimizations
10.1. Minimize draw calls
10.1.1. About minimizing draw calls
10.1.2. Limitations on combined draw calls
10.1.3. Combining textures in a texture atlas
10.1.4. Combining multiple texture atlases together
10.1.5. Combining text textures in a font atlas
10.2. Minimize state changes
10.3. Ensure the graphics pipeline is kept running
10.3.1. The graphics pipeline
10.3.2. Avoiding calls that stall the graphics pipeline
11. Vertex Processing Optimizations
11.1. Reduce the number of vertices
11.2. Use culling
11.3. Use normal maps to simulate fine geometry
11.4. Use level of detail
12. Fragment Processing Optimizations
12.1. Fragment processing optimizations
12.1.1. Reduce texture bandwidth
12.1.2. Avoid overdraw
12.1.3. Other fragment processing bound problems
12.2. Fragment shader optimizations
12.2.1. Simplify the shader
12.2.2. Reduce the number of branches
12.2.3. Other fragment shader problems
13. Bandwidth Optimizations
13.1. About reducing bandwidth
13.2. Optimize textures
13.2.1. Ensure textures are not too large
13.2.2. Use a texture resolution that fits the object on screen
13.2.3. Use low bit depth textures where possible
13.2.4. Use lower resolution textures if the texture does not contain sharp detail
13.2.5. Textures and lighting maps do not have to be the same size
13.2.6. Reduce the number of textures
13.3. Use mipmapping
13.4. Use texture compression
13.4.1. About texture compression
13.4.2. Suitability of textures for texture compression
13.4.3. Using ETC1 with transparency
13.5. Only use trilinear filtering if necessary
13.6. Reduce bandwidth by avoiding overdraw
13.7. Reduce drawing surfaces with culling
13.8. Reduce bandwidth by utilizing level of detail
14. Miscellaneous Optimizations
14.1. Use approximations
14.1.1. General methods of approximation
14.1.2. Technique specific methods of approximation
14.2. Check the display settings
14.2.1. About display settings
14.2.2. Data conversions caused by incorrect settings
14.2.3. Configuring display settings to avoid conversions
14.2.4. Ensure your application has the correct drawing surface
14.3. Use VSYNC
14.3.1. About VSYNC
14.3.2. Using VSYNC
14.3.3. Potential issues with VSYNC
14.3.4. Triple buffering
14.4. Make use of under-used resources
14.4.1. Use spare resources to increase image quality
14.4.2. Use spare resources to save power
14.4.3. Move operations from the fragment processing stage to the vertex processing stage
14.4.4. Move operations from the vertex processing stage to the fragment processing stage
14.4.5. Move operations from the application processor to the vertex processing stage
A. Utgard Architecture Performance Counters
A.1. Vertex processor performance counters
A.2. Fragment processor performance counters
B. Midgard Architecture Performance Counters

List of Figures

1.1. Mali-400 MP GPU
1.2. Mali-T600 Series GPU
1.3. OpenGL ES graphics pipeline flow
3.1. Optimization process steps
3.2. Frame time and FPS
3.3. Frame rate limitations of different system elements
3.4. Ideal application equally limited
4.1. DS-5 Streamline
4.2. DS-5 Streamline counters
4.3. High fragment processing usage
4.4. Zoomed display of high fragment processing usage
4.5. DS-5 Streamline function list
5.1. DS-5 Streamline
5.2. Workflow overview
5.3. Top level workflow
6.1. High application processor time workflow
7.1. High vertex processing time workflow
7.2. High fragment processing time workflow
7.3. High fragment shader time workflow
7.4. Bandwidth bound workflow
8.1. Midgard high vertex processing time workflow
8.2. Midgard High fragment processing time workflow
8.3. Midgard high fragment shader time workflow
8.4. Midgard bandwidth bound workflow
10.1. API calls with small data payload
10.2. API calls with large data payload
10.3. Texture atlas for sign
10.4. Sign in game
10.5. Texture atlas with multiple signs
10.6. Font atlas
10.7. Graphics pipeline flow with stall
11.1. Section of a world without culling
11.2. Section of a world with culling
11.3. Floor with normal map
11.4. Ceiling with normal map
11.5. Wireframe asteroids with levels of detail
11.6. Asteroids with levels of detail
13.1. Image with mipmap levels
13.2. Mali GPU Texture Compression Tool
13.3. Textures combined from texture atlas to create texture with transparency
13.4. Separate textures combined to create texture with transparency
14.1. Depth of field
14.2. Scene with reflections
14.3. Image display steps
14.4. Screen updates and frame completes
14.5. Screen updates and frame completes with VSYNC
14.6. Screen updates and frame completes with VSYNC reducing frame rate
14.7. Screen updates with triple buffering and VSYNC
14.8. Animated plant

Proprietary Notice

Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM® in the EU and other countries, except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their respective owners.

Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder.

The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are excluded.

This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss or damage arising from the use of any information in this document, or any error or omission in such information, or any incorrect use of the product.

Where the term ARM is used it means “ARM or any of its subsidiaries as appropriate”.

Confidentiality Status

This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this document to.

Product Status

The information in this document is final, that is for a developed product.

Revision History
Revision A30 March 2011First release
Revision B14 May 2013Second release
Revision C28 October 2013Third release. Adds support for Midgard architecture Mali GPUs
Copyright © 2011, 2013 ARM. All rights reserved.ARM DUI 0555C
Non-ConfidentialID102813