ARM® Mali™ GPU OpenCL Developer Guide

Version 3.3

Table of Contents

About this book
Product revision status
Intended audience
Using this book
Additional reading
Feedback on this product
Feedback on content
1 Introduction
1.1 About ARM® Mali™ GPUs
1.2 About OpenCL
1.3 About the Mali GPU OpenCL driver and support
2 Parallel Processing Concepts
2.1 About parallel processing
2.2 Types of parallelism
2.2.1 Data parallelism
2.2.2 Task parallelism
2.2.3 Pipelines
2.3 Mixing different types of parallelism
2.4 Embarrassingly parallel applications
2.5 Limitations of parallel processing and Amdahl's law
2.6 Concurrency
3 OpenCL Concepts
3.1 Using OpenCL
3.2 OpenCL applications
3.3 OpenCL execution model
3.4 OpenCL data processing
3.5 OpenCL work-groups
3.6 OpenCL identifiers
3.7 The OpenCL memory model
3.7.1 OpenCL memory model overview
3.7.2 Memory types in OpenCL
3.8 The Mali™ GPU OpenCL memory model
3.9 OpenCL concepts summary
4 Developing an OpenCL Application
4.1 Software and hardware requirements for Mali GPU OpenCL development
4.2 Development stages for OpenCL
5 Execution Stages of an OpenCL Application
5.1 About the execution stages
5.1.1 Platform setup
5.1.2 Runtime setup
5.2 Finding the available compute devices
5.3 Initializing and creating OpenCL contexts
5.4 Creating a command queue
5.5 Creating OpenCL program objects
5.6 Building a program executable
5.7 Creating kernel and memory objects
5.7.1 Creating kernel objects
5.7.2 Creating memory objects
5.8 Executing the kernel
5.8.1 Determining the data dimensions
5.8.2 Determining the optimal global work size
5.8.3 Determining the local work-group size
5.8.4 Enqueuing kernel execution
5.8.5 Executing kernels
5.9 Reading the results
5.10 Cleaning up unused objects
6 Converting Existing Code to OpenCL
6.1 Profiling your application
6.2 Analyzing code for parallelization
6.2.1 About analyzing code for parallelization
6.2.2 Finding data parallel operations
6.2.3 Finding operations with few dependencies
6.2.4 Analyze loops
6.3 Parallel processing techniques in OpenCL
6.3.1 Use the global ID instead of the loop counter
6.3.2 Compute values in a loop with a formula instead of using counters
6.3.3 Compute values per frame
6.3.4 Perform computations with dependencies in multiple-passes
6.3.5 Pre-compute values to remove dependencies
6.3.6 Use software pipelining
6.3.7 Use task parallelism
6.4 Using parallel processing with non-parallelizable code
6.5 Dividing data for OpenCL
6.5.1 About dividing data for OpenCL
6.5.2 Use concurrent data structures
6.5.3 Data division examples
7 Retuning Existing OpenCL Code
7.1 About retuning existing OpenCL code for Mali GPUs
7.1.1 Converting the existing OpenCL code for Mali™ GPUs
7.2 Differences between desktop-based architectures and Mali GPUs
7.2.1 About desktop-based GPU architectures
7.2.2 About Mali GPU architectures
7.2.3 Programming OpenCL for Mali GPUs
7.3 Procedure for retuning existing OpenCL code for Mali GPUs
7.3.1 Analyze code
7.3.2 Locate and remove device optimizations
7.3.3 Optimize your OpenCL code for Mali GPUs
8 Optimizing OpenCL for Mali GPUs
8.1 The optimization process for OpenCL applications
8.2 Load balancing between control threads and OpenCL threads
8.2.1 Do not use clFinish() for synchronization
8.2.2 Do not use any of the clEnqueueMap() operations with a blocking call
8.3 Memory allocation
8.3.1 About memory allocation
8.3.2 Use CL_MEM_ALLOC_HOST_PTR to avoid copying memory
8.3.3 Do not create buffers with CL_MEM_USE_HOST_PTR if possible
8.3.4 Do not allocate memory buffers created with malloc() for OpenCL applications
8.3.5 Sharing memory between I/O devices and OpenCL
8.3.6 Sharing memory in an I/O coherent system
9 OpenCL Optimizations List
9.1 General optimizations
9.2 Kernel optimizations
9.3 Code optimizations
9.4 Execution optimizations
9.5 Reducing the effect of serial computations
9.6 Mali™ Bifrost GPU specific optimizations
10 The kernel auto-vectorizer and unroller
10.1 About the kernel auto-vectorizer and unroller
10.2 Kernel auto-vectorizer options
10.2.1 Kernel auto-vectorizer command and parameters
10.2.2 Kernel auto-vectorizer command examples
10.3 Kernel unroller options
10.3.1 Kernel unroller command and parameters
10.3.2 Kernel unroller command examples
10.4 The dimension interchange transformation
A OpenCL Data Types
A.1 About OpenCL data types
A.2 OpenCL data type lists
A.2.1 Built-in scalar data types
A.2.2 Built-in vector data types
A.2.3 Other built-in data types
A.2.4 Reserved data types
B OpenCL Built-in Functions
B.1 Work-item functions
B.2 Math functions
B.3 half_ and native_ math functions
B.4 Integer functions
B.5 Common functions
B.6 Geometric functions
B.7 Relational functions
B.8 Vector data load and store functions
B.9 Synchronization
B.10 Asynchronous copy functions
B.11 Atomic functions
B.12 Miscellaneous vector functions
B.13 Image read and write functions
C OpenCL Extensions
C.1 OpenCL extensions supported by the Mali™ GPU OpenCL driver
D Using OpenCL Extensions
D.1 Inter-operation with EGL
D.2 The cl_arm_printf extension
D.2.1 About the cl_arm_printf extension
D.2.2 cl_arm_printf example
E OpenCL 1.2
E.1 OpenCL 1.2 compiler options
E.2 OpenCL 1.2 compiler parameters
E.3 OpenCL 1.2 functions
E.4 Functions deprecated in OpenCL 1.2
E.5 OpenCL 1.2 extensions
E.5.1 The cl_arm_shared_virtual_memory extension
F Revisions
F.1 Revisions

