| |||
| Home > Instruction Execution > Parallel execution | |||
The VFP11 coprocessor can execute in each of the three pipelines independently of the others and without blocking issue or writeback from any pipeline. Separate LS, FMAC, and DS pipelines enable parallel operation of CDP and data transfer instructions. Scheduling instructions to take advantage of the parallelism that occurs when multiple instructions execute in the VFP11 pipelines can result in a significant improvement in program execution time.
A data transfer operation can begin execution if:
no data hazards exist with any currently executing operations
the LS pipeline is not currently stalled by the ARM1136 processor or busy with a data transfer multiple.
A CDP can be issued to the FMAC pipeline if:
No data hazards exist with any currently executing operations
The FMAC pipeline is available. The pipeline is available if no short vector CDP is executing and no double-precision multiply is in the first cycle of the multiply operation.
No short vector operation with unissued iterations is currently executing in either the FMAC or DS pipeline.
A divide or square root instruction can be issued to the DS pipeline if:
No data hazards exist with any currently executing operations.
The DS pipeline is available. The pipeline is available if no current divide or square root is executing in the DS pipeline E1 stage.
No short vector operation with unissued iterations is executing in the FMAC pipeline.
Table 4.16 shows a case of the VFP11 coprocessor executing instructions in parallel in each of the three pipelines:
a load multiple in the L/S pipeline
a divide in the DS pipeline
a short vector add in the FMAC pipeline.
In this example, the LEN field contains b011, selecting a vector length of four iterations, and the STRIDE field contains b00, for a vector stride of one.
Example 4.13. Parallel execution in all three pipelines
FLDM [R4], {S4-S13}
FDIVS S0, S1, S2
FADDS S16, S20, S24
Table 4.16 shows the pipeline progression of the three instructions.
Table 4.16. Parallel execution in all three pipelines
| Instruction cycle number | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
FLDM | D | I | E | M1 | M2 | W | W | W | W | W | - | - | - | - | - |
FDIVS | - | D | I | E1’ | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 |
FADDS | - | - | D | I | E1 | E1 | E1 | E1 | E2 | E3 | E4 | E5 | E6 | E7 | W |
In Example 4.13, no
data hazards exist between any of the three instructions. The load multiple can
begin execution immediately, and data is transferred to the register
file beginning in cycle 6. Because the destination is in bank 0,
the FDIVS is a scalar operation and requires one cycle
in the FMAC pipeline E1 stage. If the FDIVS was a short
vector operation, the FADDS could not begin execution
until the last FDIVS iteration passed the FMAC E1 pipeline
stage. The FADDS is a short vector operation and requires
the FMAC pipeline E1 stage for cycles 5-8.
E1’ is the first cycle in E1 and is in both FMAC and DS blocks. Subsequent E1 cycles represent the iteration cycles and occupy both E1 and E2 stages in the DS block.