### 4.6.5. Matrix multiplication example

You can use NEON to improve the performance of matrix multiplication.

#### What is matrix multiplication

In mathematics, matrix multiplication is a binary operation that takes a pair of matrices, and produces another matrix. Mathematically, if C is a matrix resulting from the multiplication of two matrices, A and B, then the elements `cij` of C are given by the following equation:

Where k = 1, 2,…, n is the number of columns in A and the number of rows in B.

In other words, you multiply each of the elements of a row in the left-hand matrix by the corresponding elements of a column in the right-hand matrix, and then sum the resulting n products to obtain one element in the product matrix.

#### Matrix multiplication implementation using the C code

Suppose you want to calculate the product matrix C of multiplying the following two matrices A and B:

To use the C code to calculate the product matrix, you can store matrix elements into arrays, and use loops to calculate each element of the product matrix. The C code implementation can be as follows:

```f32 *a, *b, *c;
```
```f32 sum;
```
```
```
```for (i = 0; i < 4; i++)
```
```{
```
```	for (k = 0; k < 4; k++)
```
```	{
```
``` 		sum = 0.0f;
```
```
```
```		for (j = 0; j < 4; j++)
```
``` 		{
```
```			sum += a[i*4+j] * b[j*4 + k]
```
``` 		}
```
```
```
``` 		// Product matrix elements are calculated and stored into array c.
```
``` 		c[i*4 + k] = sum;
```
```
```
``` 	}
```
```}
```

However, the C code implementation is not efficient.

#### Matrix multiplication implementation using NEON

To achieve improved performance, you can use NEON instructions to multiply the four-by-four matrices. Assume that the matrices are stored in memory in column-major order. To identify suboperations that can be implemented using NEON instructions, expand the matrix multiply operation in detail, as shown in Figure 4.4.

Figure 4.4. Matrix multiplication showing one column of results

In Figure 4.4, each column of the first matrix in red is multiplied by a corresponding single value in the second matrix in blue. Then, the individual products are added to give a column of values for the results matrix. This operation is repeated for the remaining columns in the results matrix.

Figure 4.5. NEON vector-by-scalar multiplication

If each column is a vector in a NEON register, the vector-by-vector multiply instruction efficiently calculates each column in the results matrix. You can implement the suboperation highlighted in Figure 4.5 by using this instruction. Use the accumulating version of the multiply instruction to accumulate the individual products from each column of the first matrix.

This example operates on the columns of the first matrix, and produces a column of results. Therefore, reading and writing elements to and from memory is a linear operation, and does not require interleaving load or store instructions.

Suppose this is an implementation that multiplies single precision floating-point matrices. Each matrix contains sixteen 32-bit floating-point values.

You can begin by loading the matrices from memory into NEON registers. The matrices are stored in column-major order, so columns of the matrix are stored linearly in memory. A column can be loaded into NEON registers using `VLD1`, and written back to memory using `VST1`. You can load two matrices into NEON registers by using the following instructions:

```vld1.32  {d16-d19}, [r1]!            @ load the first eight elements of matrix 0
```
```vld1.32  {d20-d23}, [r1]!            @ load the second eight elements of matrix 0
```
```vld1.32  {d0-d3}, [r2]!              @ load the first eight elements of matrix 1
```
```vld1.32  {d4-d7}, [r2]!              @ load the second eight elements of matrix 1
```

Because NEON has 32 × 64-bit registers, you can load all of the elements from both input matrices into registers, and still have registers left for use as accumulators. You can use d16 to d23 hold 16 elements from the first matrix, and d0 to d7 hold 16 elements from the second.

In this example, you must handle columns of four 32-bit floating point numbers, which fit into a single 128-bit Q register. Therefore, you can use Q registers frequently. You can calculate a column of results by using only four NEON multiply instructions:

```vmul.f32    q12, q8, d0[0] 			@ multiply col element 0 by matrix col 0
```
```vmla.f32    q12, q9, d0[1] 			@ multiply-acc col element 1 by matrix col 1
```
```vmla.f32    q12, q10, d1[0]			@ multiply-acc col element 2 by matrix col 2
```
```vmla.f32    q12, q11, d1[1]			@ multiply-acc col element 3 by matrix col 3
```
```
```

The first instruction, vmul.f32, implements the operation highlighted in Figure 4.5 - a11, a21, a31, and a41 in register q8 are each multiplied by b11 , which is element 0 in register d0, and then stored in q12.

Subsequent instructions operate on other columns of the first matrix, and multiply by corresponding elements of the first column of the second matrix. Results are accumulated into q12 to give the first column of values for the results matrix.

### Note

The scalar used in the multiply instructions refers to D registers. Although q0[3] is the same value as d1[1], the GNU assembler might not accept the use of Q register to refer to a scalar. Therefore, specify a scalar using a D register.

If you only need to calculate a matrix-by-vector multiplication, which is a common operation in 3D graphics, the operation can be considered complete, and the result vector can be stored to memory. However, to complete the matrix-by-matrix multiplication, you must execute three more iterations:

• The second iteration uses values b12 - b42 in register q1.

• The third iteration uses values b13 - b43 in register q2.

• The fourth iteration uses values b14 - b44 in register q3.

To simplify your code, you can create a macro for the instructions above.

```.macro mul_col_f32 res_q, col0_d, col1_d
```
```vmul.f32    \res_q, q8, \col0_d[0] @ multiply col element 0 by matrix col 0
```
```vmla.f32    \res_q, q9, \col0_d[1] @ multiply-acc col element 1 by matrix col 1
```
```vmla.f32    \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2
```
```vmla.f32    \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3
```
```.endm
```

The implementation of a four-by-four floating point matrix multiplication now looks like this:

```vld1.32  {d16-d19}, [r1]! @ load first eight elements of matrix 0
```
```vld1.32  {d20-d23}, [r1]! @ load second eight elements of matrix 0
```
```vld1.32  {d0-d3}, [r2]! @ load first eight elements of matrix 1.
```
```vld1.32  {d4-d7}, [r2]! @ load second eight elements of matrix 1.
```
```
```
```mul_col_f32 q12, d0, d1 @ matrix 0 * matrix 1 col 0
```
```mul_col_f32 q13, d2, d3 @ matrix 0 * matrix 1 col 1
```
```mul_col_f32 q14, d4, d5 @ matrix 0 * matrix 1 col 2
```
```mul_col_f32 q15, d6, d7 @ matrix 0 * matrix 1 col 3
```
```
```
```vst1.32  {d24-d27}, [r0]! @ store first eight elements of result.
```
```vst1.32  {d28-d31}, [r0]! @ store second eight elements of result.
```

#### Performance comparison between C and NEON

The two different implementations of the matrix multiplication have significant performance differences. Based on the performance tests on Cortex-A8, the application using NEON is 4x times the performance of the application using the C code.