Arm® Mali™ Bifrost and Valhall OpenCL Developer Guide

Version 3.2

Table of Contents

About this book
Product revision status
Intended audience
Using this book
Additional reading
Feedback on this product
Feedback on content
1 Introduction
1.1 About Arm® Mali™ GPUs
1.2 About OpenCL
1.3 About the Mali™ GPU OpenCL driver and support
2 Parallel processing concepts
2.1 About parallel processing
2.2 Types of parallelism
2.2.1 Data parallelism
2.2.2 Task parallelism
2.2.3 Pipelines
2.3 Mixing different types of parallelism
2.4 Embarrassingly parallel applications
2.5 Limitations of parallel processing and Amdahl's law
2.6 Concurrency
3 OpenCL concepts
3.1 Using OpenCL
3.2 OpenCL applications
3.3 OpenCL execution model
3.4 OpenCL data processing
3.5 OpenCL work-groups
3.6 OpenCL identifiers
3.7 OpenCL memory model
3.7.1 OpenCL memory model overview
3.7.2 Memory types in OpenCL
3.8 Mali™ GPU OpenCL memory model
3.9 OpenCL concepts summary
4 Developing an OpenCL application
4.1 Software and hardware requirements for Mali™ GPU OpenCL development
4.2 Development stages for OpenCL
5 Execution stages of an OpenCL application
5.1 About the execution stages
5.1.1 Platform setup
5.1.2 Runtime setup
5.2 Finding the available compute devices
5.3 Initializing and creating OpenCL contexts
5.4 Creating a command queue
5.5 Creating OpenCL program objects
5.6 Building a program executable
5.7 Creating kernel and memory objects
5.7.1 Creating kernel objects
5.7.2 Creating memory objects
5.8 Executing the kernel
5.8.1 Determining the data dimensions
5.8.2 Determining the optimal global work size
5.8.3 Determining the local work-group size
5.8.4 Enqueuing kernel execution
5.8.5 Executing kernels
5.9 Reading the results
5.10 Cleaning up unused objects
6 Converting existing code to OpenCL
6.1 Profiling your application
6.2 Analyzing code for parallelization
6.2.1 About analyzing code for parallelization
6.2.2 Finding data parallel operations
6.2.3 Finding operations with few dependencies
6.2.4 Analyze loops
6.3 Parallel processing techniques in OpenCL
6.3.1 Use the global ID instead of the loop counter
6.3.2 Compute values in a loop with a formula instead of using counters
6.3.3 Compute values per frame
6.3.4 Perform computations with dependencies in multiple-passes
6.3.5 Pre-compute values to remove dependencies
6.3.6 Use software pipelining
6.3.7 Use task parallelism
6.4 Using parallel processing with non-parallelizable code
6.5 Dividing data for OpenCL
6.5.1 About dividing data for OpenCL
6.5.2 Use concurrent data structures
6.5.3 Data division examples
7 Retuning existing OpenCL code
7.1 About retuning existing OpenCL code for Mali™ GPUs
7.2 Differences between desktop-based architectures and Mali™ GPUs
7.2.1 About desktop-based GPU architectures
7.2.2 About Mali™ GPU architectures
7.2.3 Programming OpenCL for Mali™ GPUs
7.3 Retuning existing OpenCL code for Mali™ GPUs
7.3.1 Analyze code
7.3.2 Locate and remove device optimizations
7.3.3 Optimize your OpenCL code for Mali™ GPUs
8 Optimizing OpenCL for Mali™ GPUs
8.1 The optimization process for OpenCL applications
8.2 Load balancing between control threads and OpenCL threads
8.2.1 Do not use clFinish() for synchronization
8.2.2 Do not use any of the clEnqueueMap() operations with a blocking call
8.3 Optimizing memory allocation
8.3.1 About memory allocation
8.3.2 Use CL_MEM_ALLOC_HOST_PTR to avoid copying memory
8.3.3 Do not create buffers with CL_MEM_USE_HOST_PTR if possible
8.3.4 Do not allocate memory buffers created with malloc() for OpenCL applications
8.3.5 Sharing memory between I/O devices and OpenCL
8.3.6 Sharing memory in a fully coherent system
8.3.7 Sharing memory in an I/O coherent system
9 OpenCL optimizations list
9.1 General optimizations
9.2 Kernel optimizations
9.3 Code optimizations
9.4 Execution optimizations
9.5 Reducing the effect of serial computations
9.6 Mali™ Bifrost and Valhall GPU specific optimizations
10 Kernel auto-vectorizer and unroller
10.1 About the kernel auto-vectorizer and unroller
10.2 Kernel auto-vectorizer options
10.2.1 Kernel auto-vectorizer command and parameters
10.2.2 Kernel auto-vectorizer command examples
10.3 Kernel unroller options
10.3.1 Kernel unroller command and parameters
10.3.2 Kernel unroller command examples
10.4 The dimension interchange transformation
A OpenCL data types
A.1 About OpenCL data types
A.2 OpenCL data type lists
A.2.1 Built-in scalar data types
A.2.2 Built-in vector data types
A.2.3 Other built-in data types
A.2.4 Reserved data types
B OpenCL built-in functions
B.1 Work-item functions
B.2 Math functions
B.3 half_ and native_ math functions
B.4 Integer functions
B.5 Common functions
B.6 Geometric functions
B.7 Relational functions
B.8 Vector data load and store functions
B.9 Synchronization functions
B.10 Asynchronous copy functions
B.11 Atomic functions
B.12 Miscellaneous vector functions
B.13 Image read and write functions
C OpenCL extensions
C.1 OpenCL extensions supported by the Mali™ GPU OpenCL driver
D Using OpenCL extensions
D.1 Inter-operation with EGL
D.1.1 EGL images
D.1.2 ANDROID_image_native_buffer
D.1.3 EGL_EXT_image_dma_buf_import
D.2 The cl_arm_printf extension
D.2.1 About the cl_arm_printf extension
D.2.2 cl_arm_printf example
D.3 The cl_arm_import_memory extensions
D.4 The cl_arm_job_slot_selection extension
E OpenCL 1.2
E.1 OpenCL 1.2 compiler options
E.2 OpenCL 1.2 compiler parameters
E.3 OpenCL 1.2 functions
E.4 Functions deprecated in OpenCL 1.2
F OpenCL 2.0
F.1 About OpenCL 2.0
F.2 OpenCL 2.0 functions
F.2.1 OpenCL 2.0 API functions
F.2.2 OpenCL 2.0 built-in functions
F.3 OpenCL 2.0 compiler options
F.4 Program scope variables
F.5 Functions deprecated in OpenCL 2.0
F.6 OpenCL 2.0 extensions
F.7 OpenCL 2.0 optimizations
F.8 Shared virtual memory
F.9 OpenCL 2.0 pipes and device execution
G OpenCL 2.1
G.1 About OpenCL 2.1
G.2 OpenCL 2.1 functions
G.2.1 OpenCL 2.1 API functions
G.2.2 OpenCL 2.1 built-in functions
G.3 Intermediate language programs
G.4 Device and host timer functions
G.5 Queue priority hints
H Revisions
H.1 Revisions

Release Information

Document History
Issue Date Confidentiality Change
0100-00 15 February 2019 Non-Confidential First release of version 1.0
0200-00 12 April 2019 Non-Confidential First release of version 2.0
0300-00 28 June 2019 Non-Confidential First release of version 3.0
0301-00 01 August 2019 Non-Confidential First release of version 3.1
0302-00 27 November 2019 Non-Confidential First release of version 3.2

Non-Confidential Proprietary Notice

This document is protected by copyright and other related rights and the practice or implementation of the information contained in this document may be protected by one or more patents or pending patent applications. No part of this document may be reproduced in any form by any means without the express prior written permission of Arm. No license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document unless specifically stated.

Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use the information for the purposes of determining whether implementations infringe any third party patents.

THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation with respect to, and has undertaken no analysis to identify or understand the scope and content of, third party patents, copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.


This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of this document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the word “partner” in reference to Arm’s customers is not intended to create or refer to any partnership relationship with any other company. Arm may make changes to this document at any time and without notice.

If any of the provisions contained in these terms conflict with any of the provisions of any click through or signed written agreement covering this document with Arm, then the click through or signed written agreement prevails over and supersedes the conflicting provisions of these terms. This document may be translated into other languages for convenience, and you agree that if there is any conflict between the English version of this document and any translation, the terms of the English version of the Agreement shall prevail.

The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks of their respective owners. Please follow Arm’s trademark usage guidelines at

Copyright © 2019 Arm Limited (or its affiliates). All rights reserved.

Arm Limited. Company 02557590 registered in England.

110 Fulbourn Road, Cambridge, England CB1 9NJ.


Confidentiality Status

This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by Arm and the party that Arm delivered this document to.

Unrestricted Access is an Arm internal classification.

Product Status

The information in this document is Final, that is for a developed product.

Web Address

Non-ConfidentialPDF file icon PDF version101574_0302_00_en
Copyright © 2019 Arm Limited or its affiliates. All rights reserved.