| |||
| Home > Instruction Execution > Parallel execution > An example of parallel execution | |||
The VFP9-S pipelines are capable of parallel and independent execution without blocking issue or writeback from any pipeline. Table 4.18 shows:
a scalar divide in the DS pipeline
a short vector add in the FMAC pipeline
a load multiple in the LS pipeline.
In Example 4.16, the vector length is four iterations (LEN = 3), and the stride is one (STRIDE = 0).
Example 4.16. Parallel execution in all three pipelines
FDIVS S0, S1, S2FADDS S16, S20, S24
FLDM [R4], {S4-S8}
Table 4.18 shows the pipeline stages for Example 4.16. The first and shared Execute 1 cycle for the divide is designated as E1’.
Table 4.18. Pipeline stages for Example 4.16
| Instruction cycle number | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |
| FDIVS | F | D | E1’ | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E1 | E2 |
| FADDS | - | F | F | D | E1 | E1 | E1 | E1 | E2 | E3 | E4 | - | - | - | - | - | - |
| FLDM | - | - | - | F | F | D | E | M | W | W | W | W | W | - | - | - | - |
When all previous data hazards and resource hazards are removed, the instructions are issued to the VFP9-S every other cycle. No data hazards exist between the three instructions in Example 4.16. The FDIVS destination is in bank 0, so the operation is scalar regardless of the LEN value in the FPSCR register. Divide and square root instructions require one cycle in the FMAC Execute 1 stage. In this example, cycle 3 uses both the FMAC and DS Execute 1 stages. Subsequent divide cycles require only the DS Execute 1 and Execute 2 stages, and the FMAC Execute 1 stage is available. If the FDIVS were a short vector instruction, the FADDS would not begin execution until the last iteration of the FDIVS passed the first Execute 1 stage. The FADDS is a short vector instruction and requires the FMAC Execute 1 stage for cycles 5-8. To the ARM9E processor, the vector FADDS operation appears as a single-cycle instruction regardless of the number of VFP9-S iterations. If there are no stalls due to memory latency, the FLDM begins execution in cycle 4 and starts transferring data to the VFP9-S coprocessor in cycle 9. The presence of the FLDM in the ARM9E processor pipeline causes subsequent operations to stall until cycle 11, when the ARM Execute stage is no longer required by the FLDM. In cycle 11, another ARM or VFP instruction can enter the Execute stage.