3.19 Indicating loop iteration counts to the compiler with __promise(expr)

The __promise intrinsic lets you indicate to the compiler that a loop iteration count is, for example, always divisible by 8. This enables the compiler to generate smaller and faster code by reducing the overhead of runtime iteration count tests.

The NEON unit can operate on elements in groups of 2, 4, 8, or 16. Where the iteration count at the start of the loop is unknown, the compiler might add a runtime test to check if the iteration count is not a multiple of the lanes that can be used for the appropriate data type in a NEON register. This increases code size because additional nonvectorized code is generated to execute any additional loop iterations.

The overhead added by the runtime test is typically insignificant compared with the performance increase that arises from the vectorized code, although corner cases do exist. For example, an iteration count of 17 gives a group of 16 elements to operate on in parallel, with 1 iteration left over as nonvectorized code, whereas an iteration count of 3 gives a group of only 2 elements to operate on in parallel. In the latter case, the overhead of the runtime test is proportionally greater in comparison with the vectorized code.

If you know that the iteration count is divisible by the number of elements that the NEON unit can operate on in parallel, you can indicate this to the compiler using the __promise intrinsic, for example:

/* Promise the compiler that the loop iteration count is divisible by 16 */
__promise((k % 16) == 0);
for (i = 0; i < k; i++)
{
    ...
}

The __promise intrinsic is required to enable vectorization if the loop iteration count at the start of the loop is unknown, providing you can make the promise that you claim to make.

This reduces the size of the generated code and can give a performance improvement.

The disassembled output of the example code below illustrates the difference that __promise makes. The disassembly is reduced to a simple vectorized loop with the removal of nonvectorized code that would otherwise have been required for possible additional loop iterations. That is, loop iterations beyond those that are a multiple of the lanes that can be used for the appropriate data type in a NEON register. (The additional nonvectorized code is known as a scalar fix-up loop. With the use of the __promise(expr) intrinsic, the scalar fix-up loop is removed.)

/* promise.c */
void f(int *x, int n)
{
    int i;
    __promise((n > 0) && ((n & 7) == 0));
    for (i=0; i < n; i++) x[i]++;
}

When compiling for a processor that supports NEON, the disassembled output might be similar to the following, for example:

        ARM
        REQUIRE8
        PRESERVE8
        AREA ||.text||, CODE, READONLY, ALIGN=2
f PROC
        VMOV.I32 q0,#0x1
        ASR      r1,r1,#2
|L0.8|
        VLD1.32  {d2,d3},[r0]
        SUBS     r1,r1,#1
        VADD.I32 q1,q1,q0
        VST1.32  {d2,d3},[r0]!
        BNE      |L0.8|
        BX       lr
        ENDP
Non-ConfidentialPDF file icon PDF versionARM DUI0472M
Copyright © 2010-2016 ARM Limited or its affiliates. All rights reserved.