ARM® Mali™ GPU OpenCL Developer Guide

Version 3.3

Table of Contents

About this book
Product revision status
Intended audience
Using this book
Additional reading
Feedback on this product
Feedback on content
1 Introduction
1.1 About ARM® Mali™ GPUs
1.2 About OpenCL
1.3 About the Mali GPU OpenCL driver and support
2 Parallel Processing Concepts
2.1 About parallel processing
2.2 Types of parallelism
2.2.1 Data parallelism
2.2.2 Task parallelism
2.2.3 Pipelines
2.3 Mixing different types of parallelism
2.4 Embarrassingly parallel applications
2.5 Limitations of parallel processing and Amdahl's law
2.6 Concurrency
3 OpenCL Concepts
3.1 Using OpenCL
3.2 OpenCL applications
3.3 OpenCL execution model
3.4 OpenCL data processing
3.5 OpenCL work-groups
3.6 OpenCL identifiers
3.7 The OpenCL memory model
3.7.1 OpenCL memory model overview
3.7.2 Memory types in OpenCL
3.8 The Mali™ GPU OpenCL memory model
3.9 OpenCL concepts summary
4 Developing an OpenCL Application
4.1 Software and hardware requirements for Mali GPU OpenCL development
4.2 Development stages for OpenCL
5 Execution Stages of an OpenCL Application
5.1 About the execution stages
5.1.1 Platform setup
5.1.2 Runtime setup
5.2 Finding the available compute devices
5.3 Initializing and creating OpenCL contexts
5.4 Creating a command queue
5.5 Creating OpenCL program objects
5.6 Building a program executable
5.7 Creating kernel and memory objects
5.7.1 Creating kernel objects
5.7.2 Creating memory objects
5.8 Executing the kernel
5.8.1 Determining the data dimensions
5.8.2 Determining the optimal global work size
5.8.3 Determining the local work-group size
5.8.4 Enqueuing kernel execution
5.8.5 Executing kernels
5.9 Reading the results
5.10 Cleaning up unused objects
6 Converting Existing Code to OpenCL
6.1 Profiling your application
6.2 Analyzing code for parallelization
6.2.1 About analyzing code for parallelization
6.2.2 Finding data parallel operations
6.2.3 Finding operations with few dependencies
6.2.4 Analyze loops
6.3 Parallel processing techniques in OpenCL
6.3.1 Use the global ID instead of the loop counter
6.3.2 Compute values in a loop with a formula instead of using counters
6.3.3 Compute values per frame
6.3.4 Perform computations with dependencies in multiple-passes
6.3.5 Pre-compute values to remove dependencies
6.3.6 Use software pipelining
6.3.7 Use task parallelism
6.4 Using parallel processing with non-parallelizable code
6.5 Dividing data for OpenCL
6.5.1 About dividing data for OpenCL
6.5.2 Use concurrent data structures
6.5.3 Data division examples
7 Retuning Existing OpenCL Code
7.1 About retuning existing OpenCL code for Mali GPUs
7.1.1 Converting the existing OpenCL code for Mali™ GPUs
7.2 Differences between desktop-based architectures and Mali GPUs
7.2.1 About desktop-based GPU architectures
7.2.2 About Mali GPU architectures
7.2.3 Programming OpenCL for Mali GPUs
7.3 Procedure for retuning existing OpenCL code for Mali GPUs
7.3.1 Analyze code
7.3.2 Locate and remove device optimizations
7.3.3 Optimize your OpenCL code for Mali GPUs
8 Optimizing OpenCL for Mali GPUs
8.1 The optimization process for OpenCL applications
8.2 Load balancing between control threads and OpenCL threads
8.2.1 Do not use clFinish() for synchronization
8.2.2 Do not use any of the clEnqueueMap() operations with a blocking call
8.3 Memory allocation
8.3.1 About memory allocation
8.3.2 Use CL_MEM_ALLOC_HOST_PTR to avoid copying memory
8.3.3 Do not create buffers with CL_MEM_USE_HOST_PTR if possible
8.3.4 Do not allocate memory buffers created with malloc() for OpenCL applications
8.3.5 Sharing memory between I/O devices and OpenCL
8.3.6 Sharing memory in an I/O coherent system
9 OpenCL Optimizations List
9.1 General optimizations
9.2 Kernel optimizations
9.3 Code optimizations
9.4 Execution optimizations
9.5 Reducing the effect of serial computations
9.6 Mali™ Bifrost GPU specific optimizations
10 The kernel auto-vectorizer and unroller
10.1 About the kernel auto-vectorizer and unroller
10.2 Kernel auto-vectorizer options
10.2.1 Kernel auto-vectorizer command and parameters
10.2.2 Kernel auto-vectorizer command examples
10.3 Kernel unroller options
10.3.1 Kernel unroller command and parameters
10.3.2 Kernel unroller command examples
10.4 The dimension interchange transformation
A OpenCL Data Types
A.1 About OpenCL data types
A.2 OpenCL data type lists
A.2.1 Built-in scalar data types
A.2.2 Built-in vector data types
A.2.3 Other built-in data types
A.2.4 Reserved data types
B OpenCL Built-in Functions
B.1 Work-item functions
B.2 Math functions
B.3 half_ and native_ math functions
B.4 Integer functions
B.5 Common functions
B.6 Geometric functions
B.7 Relational functions
B.8 Vector data load and store functions
B.9 Synchronization
B.10 Asynchronous copy functions
B.11 Atomic functions
B.12 Miscellaneous vector functions
B.13 Image read and write functions
C OpenCL Extensions
C.1 OpenCL extensions supported by the Mali™ GPU OpenCL driver
D Using OpenCL Extensions
D.1 Inter-operation with EGL
D.2 The cl_arm_printf extension
D.2.1 About the cl_arm_printf extension
D.2.2 cl_arm_printf example
E OpenCL 1.2
E.1 OpenCL 1.2 compiler options
E.2 OpenCL 1.2 compiler parameters
E.3 OpenCL 1.2 functions
E.4 Functions deprecated in OpenCL 1.2
E.5 OpenCL 1.2 extensions
E.5.1 The cl_arm_shared_virtual_memory extension
F Revisions
F.1 Revisions

Release Information

Document History
Issue Date Confidentiality Change
A 12 July 2012 Confidential First release
D 07 November 2012 Confidential Second release
E 27 February 2013 Non-Confidential Third release
F 03 December 2013 Non-Confidential Fourth release
G 13 May 2015 Non-Confidential First release for r6p0
H 10 August 2015 Non-Confidential First release for r7p0
I 01 October 2015 Non-Confidential First release for r8p0
0900-00 03 December 2015 Non-Confidential First release for r9p0
1000-00 21 January 2016 Non-Confidential First release for r10p0
1100-00 24 March 2016 Non-Confidential First release for r11p0
0300-00 21 April 2016 Non-Confidential Changed to version 3.0
0301-00 13 May 2016 Non-Confidential First release of version 3.1
0302-00 12 July 2016 Non-Confidential First release of version 3.2
0303-00 22 February 2017 Non-Confidential First release of version 3.3

Non-Confidential Proprietary Notice

This document is protected by copyright and other related rights and the practice or implementation of the information contained in this document may be protected by one or more patents or pending patent applications. No part of this document may be reproduced in any form by any means without the express prior written permission of ARM. No license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document unless specifically stated.

Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use the information for the purposes of determining whether implementations infringe any third party patents.

THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, ARM makes no representation with respect to, and has undertaken no analysis to identify or understand the scope and content of, third party patents, copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.


This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of this document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word “partner” in reference to ARM’s customers is not intended to create or refer to any partnership relationship with any other company. ARM may make changes to this document at any time and without notice.

If any of the provisions contained in these terms conflict with any of the provisions of any signed written agreement covering this document with ARM, then the signed written agreement prevails over and supersedes the conflicting provisions of these terms. This document may be translated into other languages for convenience, and you agree that if there is any conflict between the English version of this document and any translation, the terms of the English version of the Agreement shall prevail.

Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM Limited or its affiliates in the EU and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks of their respective owners. Please follow ARM’s trademark usage guidelines at

Copyright © 2012, 2013, 2015–2017, ARM Limited or its affiliates. All rights reserved.

ARM Limited. Company 02557590 registered in England.

110 Fulbourn Road, Cambridge, England CB1 9NJ.


Confidentiality Status

This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this document to.

Unrestricted Access is an ARM internal classification.

Product Status

The information in this document is Final, that is for a developed product.

Web Address

Non-ConfidentialPDF file icon PDF versionARM 100614_0303_00_en
Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved.