1.7. Parallel execution of instructions
The VFP11 coprocessor can execute several floating-point
operations in parallel, while the ARM1136JF-S processor is executing
ARM instructions. While a short vector operation executes for a
number of cycles in the VFP11 coprocessor, it appears to the ARM1136 processor
as a single-cycle instruction and is retired in the ARM1136 processor
before it completes execution in the VFP11 coprocessor. The
three pipelines in the VFP coprocessor operate independently of
one another once initial processing is completed. This means you
can issue a short vector operation, and issue a load or store multiple
operation in the next cycle, and have both executing at the same
time, provided no data hazards exist between the two instructions.
With this mechanism, you can write algorithms that can be double-buffered to
hide much of the time to transfer data to and from the VFP11 coprocessor
under the arithmetic operations. This results in a significant improvement
in performance. The separate DS pipeline enables both data transfer
operations and CDPs that are not to the DS pipeline to execute in
parallel with the divide. The DS block has a dedicated write port
to the register file, and executing operations in parallel with
divide or square root instructions does not require any special care. For
more information see Parallel execution.