D.4 The cl_arm_job_slot_selection extension

The cl_arm_job_slot_selection extension enables applications to select the job slot to use for work submission to the GPU.

The job slot is selected at command queue creation time via the CL_QUEUE_JOB_SLOT_ARM property. Applications can use the CL_DEVICE_JOB_SLOTS_ARM device info query to get a bitmask of allowed job slots.

One possible use-case for this extension could be to reduce the latency of submitting work to the GPU when the GPU is shared between multiple workloads. For example, context switching between two workloads requires the work scheduled on the GPU to yield to a soft-stop request before scheduling the new work, which in turn requires all the threads running on the GPU to complete execution. This can take a long time with threads running compute work and lead to an unacceptable latency before other workloads begin execution on the GPU. To improve this behavior, is it possible to partition the GPU into multiple groups of shader cores that can each be targeted by different job slots. Beginning execution on a given partition does not require the work on other partitions to complete, therefore reducing the latency of starting the new work.

The Mali kernel module exposes a core_mask sysfs file that allows allocation of shader cores to specific job slots, effectively creating partitions in the GPU. For example, for a typical GPU that has 12 cores and three job slots, the default masks for all job slots would likely be 0xFFF, meaning that all cores are allocated to all job slots. This allocation can be changed by writing a space-separated list of core masks for each of the job slots to the core_mask file. This creates two partitions, one allocating four cores to job slots 0 and 1, and another allocating eight cores to job slot 2, as follows:

echo ‘0xF00 0xF00 0x0FF’ > /sys/path/to/mali/core_mask

An OpenCL application can then use the extension to send compute work to job slot 2, targeting the partition with eight cores. Any other work on the GPU that must start execution quickly can target job slot 0 and or 1 without needing the work resident on the GPU partition linked to job slot 2 to soft-stop.

Power management strategies have to be considered when partitioning the GPU. Attempting to submit work to a job slot that does not have any online cores will result in a GPU fault.

Note:

For more information about this extension, see the Khronos extension specifications at https://www.khronos.org/.
Non-ConfidentialPDF file icon PDF version101574_0301_00_en
Copyright © 2019 Arm Limited or its affiliates. All rights reserved.