ARM® Mali™-T600 Series GPU OpenCL Developer Guide

Version 2.0

Table of Contents

About this book
Product revision status
Intended audience
Using this book
Typographical conventions
Additional reading
Feedback on this product
Feedback on content
1. Introduction
1.1. About GPU compute
1.2. About OpenCL
1.3. About the Mali-T600 Series GPU Linux OpenCL driver
1.4. About the Mali OpenCL SDK
2. Parallel Processing Concepts
2.1. Types of parallelism
2.2. Concurrency
2.3. Limitations of parallel processing
2.4. Embarrassingly parallel applications
2.5. Mixing different types of parallelism
3. OpenCL Concepts
3.1. About OpenCL
3.2. OpenCL applications
3.3. OpenCL execution model
3.4. OpenCL data processing
3.4.1. Work-items and the NDRange
3.4.2. OpenCL work-groups
3.4.3. Identifiers in OpenCL
3.5. The OpenCL memory model
3.6. The Mali GPU memory model
3.7. OpenCL concepts summary
4. Developing an OpenCL Application
4.1. Software and hardware required for OpenCL development
4.2. Development stages
5. Execution Stages of an OpenCL Application
5.1. About the execution stages
5.1.1. Platform setup
5.1.2. Runtime setup
5.2. Finding the available compute devices
5.3. Initializing and creating OpenCL contexts
5.4. Creating a command queue
5.5. Creating OpenCL program objects
5.6. Building a program executable
5.7. Creating kernel and memory objects
5.7.1. Creating kernel objects
5.7.2. Creating memory objects
5.8. Executing the kernel
5.8.1. Determining the data dimensions
5.8.2. Determining the optimal global work size
5.8.3. Determining the local work-group size
5.8.4. Enqueuing kernel execution
5.8.5. Executing kernels
5.9. Reading the results
5.10. Cleaning up
6. Converting Existing Code to OpenCL
6.1. Profile your application
6.2. Analyzing code for parallelization
6.2.1. About analyzing code for parallelization
6.2.2. Look for data parallel operations
6.2.3. Look for operations with few dependencies
6.2.4. Analyze loops
6.3. Parallel processing techniques in OpenCL
6.3.1. Use the global ID instead of the loop counter
6.3.2. Compute values in a loop with a formula instead of using counters
6.3.3. Compute values per frame
6.3.4. Perform computations with dependencies in multiple-passes
6.3.5. Pre-compute values to remove dependencies
6.3.6. Use software pipelining
6.3.7. Use task parallelism
6.4. Using parallel processing with non-parallelizable code
6.5. Dividing data for OpenCL
6.5.1. About dividing data for OpenCL
6.5.2. Use concurrent data structures
6.5.3. Data division examples
7. Retuning Existing OpenCL Code for Mali GPUs
7.1. About retuning existing OpenCL code for Mali GPUs
7.2. Differences between desktop based architectures and Mali GPUs
7.2.1. About desktop based GPU architectures
7.2.2. About the architecture of the Mali-T600 Series GPUs
7.2.3. Programming a Mali-T600 Series GPU
7.3. Procedure for retuning existing OpenCL code for Mali GPUs
7.3.1. Analyze code
7.3.2. Locate and remove device optimizations
7.3.3. Optimizing your OpenCL code for Mali GPUs
8. Optimizing OpenCL for Mali GPUs
8.1. The optimization process for OpenCL applications
8.1.1. Measure individual kernels
8.1.2. Select the kernels that take the most time
8.1.3. Analyze the kernels
8.1.4. Measure individual parts of the kernel
8.2. Load balancing between the application processor and the Mali GPU
8.3. Sharing memory between I/O devices and OpenCL
9. OpenCL Optimizations List
9.1. General optimizations
9.2. Memory optimizations
9.2.1. About memory optimizations
9.2.2. Use CL_MEM_ALLOC_HOST_PTR to avoid copying memory
9.2.3. Do not allocate memory buffers created with malloc() for OpenCL applications
9.2.4. Do not create buffers with CL_MEM_USE_HOST_PTR if possible
9.3. Kernel optimizations
9.4. Code optimizations
9.5. Execution optimizations
9.6. Reducing the effect of serial computations
10. The Mali OpenCL SDK
A. OpenCL Data Types
B. OpenCL Built-in Functions
B.1. Work-item functions
B.2. Math functions
B.3. half_ and native_ math functions
B.4. Integer functions
B.5. Common functions
B.6. Geometric functions
B.7. Relational functions
B.8. Vector data load and store functions
B.9. Synchronization
B.10. Asynchronous copy functions
B.11. Atomic functions
B.12. Miscellaneous vector functions
B.13. Image read and write functions
C. OpenCL Extensions

Proprietary Notice

Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM® in the EU and other countries, except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their respective owners.

Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder.

The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are excluded.

This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss or damage arising from the use of any information in this document, or any error or omission in such information, or any incorrect use of the product.

Where the term ARM is used it means “ARM or any of its subsidiaries as appropriate”.

Confidentiality Status

This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this document to.

Product Status

The information in this document is final, that is for a developed product.

Revision History
Revision A12 July 2012First release
Revision D07 November 2012Second release
Revision E27 February 2013Third release
Revision F03 December 2013Fourth release
Copyright © 2012-2013 ARM. All rights reserved.DUI0538F