3.3 Embedding SVE assembly code directly into C and C++ code

Inline assembly (or inline asm) provides a mechanism for inserting hand-written assembly instructions into C and C++ code. This lets you vectorize parts of a function by hand without having to write the entire function in assembly code.

Note:

This information assumes that you are familiar with details of the SVE Architecture, including vector-width agnostic registers, predication, and WHILE operations.

Using inline assembly rather than writing a separate .s file has the following advantages:

  • Shifts the burden of handling the procedure call standard (PCS) from the programmer to the compiler. This includes allocating the stack frame and preserving all necessary callee-saved registers.
  • Inline assembly code gives the compiler more information about what the assembly code does.
  • The compiler can inline the function that contains the assembly code into its callers.
  • Inline assembly code can take immediate operands that depend on C-level constructs, such as the size of a structure or the byte offset of a particular structure field.

Structure of an inline assembly statement

The compiler supports the GNU form of inline assembly. Note that it does not support the Microsoft form of inline assembly.

More detailed documentation of the asm construct is available at the GCC website.

Inline assembly statements have the following form:

asm ("instructions" : outputs : inputs : side-effects);

Where:

instructions
is a text string that contains AArch64 assembly instructions, with at least one newline sequence \n between consecutive instructions.
outputs
is a comma-separated list of outputs from the assembly instructions.
inputs
is a comma-separated list of inputs to the assembly instructions.
side-effects
is a comma-separated list of effects that the assembly instructions have, besides reading from inputs and writing to outputs.

Additionally, the asm keyword might need to be followed by the volatile keyword.

Outputs

Each entry in outputs has one of the following forms:

[name] "=&register-class" (destination)
[name] "=register-class" (destination)

The first form has the register class preceded by =&. This specifies that the assembly instructions might read from one of the inputs (specified in the asm statement's inputs section) after writing to the output.

The second form has the register class preceded by =. This specifies that the assembly instructions never read from inputs in this way. Using the second form is an optimization. It allows the compiler to allocate the same register to the output as it allocates to one of the inputs.

Both forms specify that the assembly instructions produce an output that is stored in the C object specified by destination. This can be any scalar value that is valid for the left-hand side of a C assignment. The register-class field specifies the type of register that the assembly instructions require. It can be one of:

r
if the register for this output when used within the assembly instructions is a general-purpose register (x0-x30)
w
if the register for this output when used within the assembly instructions is a SIMD and floating-point register (v0-v31).

It is not possible at present for outputs to contain an SVE vector or predicate value. All uses of SVE registers must be internal to the inline assembly block.

It is the responsibility of the compiler to allocate a suitable output register and to copy that register into the destination after the asm statement is executed. The assembly instructions within the instructions section of the asm statement can use one of the following forms to refer to the output value:

%[name]
to refer to an r-class output as xN or a w-class output as vN
%w[name]
to refer to an r-class output as wN
%s[name]
to refer to a w-class output as sN
%d[name]
to refer to a w-class output as dN

In all cases N represents the number of the register that the compiler has allocated to the output. The use of these forms means that it is not necessary for the programmer to anticipate precisely which register is selected by the compiler. The following example creates a function that returns the value 10. It shows how the programmer is able to use the %w[res] form to describe the movement of a constant into the output register without knowing which register is used.

int f()
{
  int result;
  asm("movz %w[res], #10" : [res] "=r" (result));
  return result;
}

In optimized output the compiler picks the return register (0) for res, resulting in the following assembly code:

movz w0, #10
ret

Inputs

Within an asm statement, each entry in the inputs section has the form:

[name] "operand-type" (value)

This construct specifies that the asm statement uses the scalar C expression value as an input, referred to within the assembly instructions as name. The operand-type field specifies how the input value is handled within the assembly instructions. It can be one of the following:

r
if the input is to be placed in a general-purpose register (x0-x30)
w
if the input is to be placed in a SIMD and floating-point register (v0-v31).
[output-name]
if the input is to be placed in the same register as output output-name. In this case the [name] part of the input specification is redundant and can be omitted. The assembly instructions can use the forms described in the Outputs section above (%[name], %w[name], %s[name], %d[name]) to refer to both the input and the output.
i
if the input is an integer constant and is used as an immediate operand. The assembly instructions use %[name] in place of immediate operand #N, where N is the numerical value of value.

In the first two cases, it is the responsibility of the compiler to allocate a suitable register and to ensure that it contains value on entry to the assembly instructions. The assembly instructions must refer to these registers using the same syntax as for the outputs (%[name], %w[name], %s[name], %d[name]).

It is not possible at present for inputs to contain an SVE vector or predicate value. All uses of SVE registers must be internal to instructions.

This example shows an asm directive with the same effect as the previous example, except that an i-form input is used to specify the constant to be assigned to the result.

int f()
{
   int result;
   asm("movz %w[res], %[value]" : [res] "=r" (result) : [value] "i" (10));
   return result;
}

Side effects

Many asm statements have effects other than reading from inputs and writing to outputs. This is particularly true of asm statements that implement vectorized loops, since most such loops read from or write to memory. The side-effects section of an asm statement tells the compiler what these additional effects are. Each entry must be one of the following:

"memory"
if the asm statement reads from or writes to memory. This is necessary even if inputs contain pointers to the affected memory.
"cc"
if the asm statement modifies the condition-code flags.
"xN"
if the asm statement modifies general-purpose register N.
"vN"
if the asm statement modifies SIMD and floating-point register N.
"zN"
if the asm statement modifies SVE vector register N. Since SVE vector registers extend the SIMD and floating-point registers, this is equivalent to writing "vN".
"pN"
if the asm statement modifies SVE predicate register N.

Use of volatile

Sometimes an asm statement might have dependencies and side effects that cannot be captured by the asm statement syntax. For example, suppose there are three separate asm statements (not three lines within a single asm statement), that do the following:

  • The first sets the floating-point rounding mode.
  • The second executes on the assumption that the rounding mode set by the first statement is in effect.
  • The third statement restores the original floating-point rounding mode.

It is important that these statements are executed in order, but the asm statement syntax provides no direct method for representing the dependency between them. Instead, each statement must add the keyword volatile after asm. This prevents the compiler from removing the asm statement as dead code, even if the asm statement does not modify memory and if its results appear to be unused. The compiler always executes asm volatile statements in their original order.

For example:

asm volatile ("msr fpcr, %[flags]" :: [flags] "r" (new_fpcr_value));

Note:

An asm volatile statement must still have a valid side effects list. For example, an asm volatile statement that modifies memory must still include "memory" in the side-effects section.

Labels

The compiler might output a given asm statement more than once, either as a result of optimizing the function that contains the asm statement or as a result of inlining that function into some of its callers. Therefore, asm statements must not define named labels like .loop, since if the asm statement is written more than once, the output contains more than one definition of label .loop. Instead, the assembler provides a concept of relative labels. Each relative label is simply a number and is defined in the same way as a normal label. For example, relative label 1 is defined by:

1:

The assembly code can contain many definitions of the same relative label. Code that refers to a relative label must add the letter f to refer the next definition (f is for forward) or the letter b (backward) to refer to the previous definition. A typical assembly loop with a pre-loop test would therefore have the following structure. This allows the compiler output to contain many copies of this code without creating any ambiguity.

    ...pre-loop test...
    b.none      2f
1:
    ...loop...
    b.any       1b
2:

Example

The following example shows a simple function that performs a fused multiply-add operation (x=a∙b+c) across four passed-in arrays of a size specified by n:

void f(double *restrict x, double *restrict a, double *restrict b, double *restrict c, 
			unsigned long n)
{
  for (unsigned long i = 0; i < n; ++i)
  {  
    x[i] = fma(a[i], b[i], c[i]);
  }
}

An asm statement that exploited SVE instructions to achieve equivalent behavior might look like the following:

void f(double *x, double *a, double *b, double *c, unsigned long n)
{
  unsigned long i;
  asm ("whilelo p0.d, %[i], %[n]               \n\
  1:                                           \n\
        ld1d z0.d, p0/z, [%[a], %[i], lsl #3]  \n\
        ld1d z1.d, p0/z, [%[b], %[i], lsl #3]  \n\
        ld1d z2.d, p0/z, [%[c], %[i], lsl #3]  \n\
        fmla z2.d, p0/m, z0.d, z1.d            \n\
        st1d z2.d, p0, [%[x], %[i], lsl #3]    \n\
        uqincd %[i]                            \n\
        whilelo p0.d, %[i], %[n]               \n\
        b.any 1b"
   : [i] "=&r" (i)
   : "[i]" (0),
     [x] "r" (x),
     [a] "r" (a),
     [b] "r" (b),
     [c] "r" (c),
     [n] "r" (n)
   : "memory", "cc", "p0", "z0", "z1", "z2");
}

Note:

Keeping the restrict qualifiers would be valid but would have no effect.

The input specifier "[i]" (0) indicates that the assembly statements take an input 0 in the same register as output [i]. In other words, the initial value of [i] must be zero. The use of =& in the specification of [i] indicates that [i] cannot be allocated to the same register as [x], [a], [b], [c], or [n] (because the assembly instructions use those inputs after writing to [i]).

In this example, the C variable i is not used after the asm statement. In effect the asm statement is simply reserving a register that it can use as scratch space. Including "memory" in the side effects list indicates that the asm statement reads from and writes to memory. The compiler must therefore keep the asm statement even though i is not used.

Non-ConfidentialPDF file icon PDF version100891_0609_00_en
Copyright © 2016, 2017 Arm Limited (or its affiliates). All rights reserved.