| |||
| Home > VFP Instruction Execution > Parallel execution | |||
The VFP11 coprocessor is capable of execution in each of the three pipelines independently of the others and without blocking issue or writeback from any pipeline. Separate LS, FMAC, and DS pipelines allow for parallel operation of CDP and data transfer instructions. Scheduling instructions to take advantage of the parallelism that occurs when multiple instructions execute in the VFP11 pipelines can result in a significant improvement in program execution time.
A data transfer operation can begin execution if:
no data hazards exist with any currently executing operations
the LS pipeline is not currently stalled by the ARM11 processor or busy with a data transfer multiple.
A CDP can be issued to the FMAC pipeline if:
no data hazards exist with any currently executing operations
the FMAC pipeline is available (no short vector CDP is executing and no double-precision multiply is in the first cycle of the multiply operation)
no short vector operation with unissued iterations is currently executing in either the FMAC or DS pipeline.
A divide or square root instruction can be issued to the DS pipeline if:
no data hazards exist with any currently executing operations
the DS pipeline is available (no current divide or square root is executing in the DS pipeline E1 stage)
no short vector operation with unissued iterations is executing in the FMAC pipeline.
Table 22.15 shows a case of the VFP11 coprocessor executing instructions in parallel in each of the three pipelines:
a load multiple in the L/S pipeline
a divide in the DS pipeline
a short vector add in the FMAC pipeline.
In this example, the LEN field contains b011, selecting a vector length of four iterations, and the STRIDE field contains b00, for a vector stride of one.
Example 22.13. Parallel execution in all three pipelines
FLDM [R4], {S4-S13}
FDIVSS0, S1, S2
FADDS S16, S20, S24
Table 22.15 shows the pipeline progression of the three instructions.
Table 22.15. Parallel execution in all three pipelines
| Instruction cycle number | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
FLDM | D | I | E | M1 | M2 | W | W | W | W | W | - | - | - | - | - |
FDIVS | - | D | I | E1’ | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 |
FADDS | - | - | D | I | E1 | E1 | E1 | E1 | E2 | E3 | E4 | E5 | E6 | E7 | W |
In Example 22.13, no
data hazards exist between any of the three instructions. The load multiple
is able to begin execution immediately, and data is transferred
to the register file beginning in cycle 6. Because the destination
is in bank 0, the FDIVS is a scalar operation and requires
one cycle in the FMAC pipeline E1 stage. If the FDIVS were
a short vector operation, the FADDS could not begin execution until
the last FDIVS iteration passed the FMAC E1 pipeline
stage. The FADDS is a short vector operation and requires the FMAC
pipeline E1 stage for cycles 5-8.
E1’ is the first cycle in E1 and is in both FMAC and DS blocks. Subsequent E1 cycles represent the iteration cycles and occupy both E1 and E2 stages in the DS block.