10.9.3 Modeling branch prediction

This section demonstrates various techniques for measuring the effectiveness of different branch prediction algorithms.

Branch predictor types and parameters

The BranchPrediction plugin allows you to select the branch prediction algorithm to use, the type of statistics to collect, and the misprediction latency.

The plugin parameters that are used in this tutorial are as follows:

Table 10-6 BranchPrediction plugin parameters

Plugin parameter Purpose in this example Values that are used in this example
predictor-type Comparing the impact of different branch prediction algorithms.
  • FixedDirectionPredictor

  • BiModalPredictor

  • GSharePredictor

  • CortexA53Predictor

mispredict-latency Simulating the additional latency due to a pipeline flush that is caused by a branch misprediction.

11. This value is the minimum pipeline flush length for a Cortex-A73 processor.

bpstat-pathfilename Providing statistics about the branch prediction behavior, to determine per-branch and overall predictor accuracy. stats.txt

The different predictor types that are used in this example behave as follows:

FixedDirectionPredictor

Always predicts branches as TAKEN.

BiModalPredictor

Uses a 2-bit state machine to classify branches as one of STRONGLY_NOT_TAKEN, WEAKLY_NOT_TAKEN, WEAKLY_TAKEN, or STRONGLY_TAKEN, and predicts accordingly. Tracks up to 512 individual branch instructions by address.

GSharePredictor

Uses the history of the eight most recently executed branch instructions to classify a set of branch instructions, based on the instruction address, as one of STRONGLY_NOT_TAKEN, WEAKLY_NOT_TAKEN, WEAKLY_TAKEN, or STRONGLY_TAKEN, and predicts accordingly. Unlike the BiModalPredictor, it is not limited to a specific number of branch instruction addresses, but it is less precise than BiModalPredictor.

CortexA53Predictor

Implements the Cortex-A53 branch prediction algorithm.

To help you understand the algorithms in more detail, the source code for these branch predictors, except CortexA53Predictor, is provided under $PVLIB_HOME/plugins/source/BranchPrediction/.

Generating branch misprediction statistics

There are two ways to trace branch mispredictions when running an application:

  • Use the statistics that are produced by the BranchPrediction plugin to get an overall picture, without context about the execution order.
  • Load the BranchPrediction plugin and use the MTI trace sources INST, BRANCH_MISPREDICT, and WAYPOINT to see branch misprediction details for individual instructions in execution order.
BranchPrediction plugin statistics

The statistics feature of the BranchPrediction plugin provides overall and per-branch statistics, which are saved to a file when the model exits. You can specify the filename and location using the bpstat-pathfilename parameter.

The overall branch prediction statistics are described in the following table:

Table 10-7 Overall statistics

Statistic Description Example
Processor Core Name of the core to which the branch prediction plugin was connected. ARM_Cortex-A73
Cluster instance The cluster number in the processor. 0
Core instance The core number in the cluster. 0
Mispredict Latency The branch misprediction latency as specified using the mispredict-latency parameter. 11
Image executed The name of the application file that was executed. ta_bpred.axf
PredictorType The branch prediction algorithm as specified using the predictor-type parameter. FixedDirectionPredictor
Total branch calls The total number of times all branch instructions were executed. 37434
Total Mispredictions The total number of mispredictions for all executed branch instructions. 5106
Average prediction accuracy The fraction of all branch instructions that were correctly predicted. 0.8636
Conditional Branches The total number of unique conditional branch instructions. This figure does not include the instructions CBZ and CBNZ. 123
Total unique branch instructions The total number of unique conditional and unconditional branch instructions. 300

The following table shows the BranchPrediction plugin statistics for each unique branch instruction. They can be used to analyze how a given branch prediction algorithm behaves with a particular type of branch instruction. The branch prediction example program uses this information to determine how effectively the different branch prediction algorithms predict different types of branches.

Table 10-8 Per-branch statistics

Statistic Description Example
PC Addr The address of the branch instruction. 0x8000062c
Calls The total number of times the branch was called. 2100
Mispredict The total number of times the branch was mispredicted. 260
Accuracy The fraction of calls to the branch instruction that were correctly predicted. 0.87619
MTI trace sources

The INST trace source, described earlier in this tutorial, can be used to show the latency that is added to the instruction execution time by a branch misprediction.

Whenever the BranchPrediction plugin makes a branch misprediction, the BRANCH_MISPREDICT trace source prints the address of the branch instruction that was mispredicted. This address can be compared with the address from the corresponding INST trace event to determine the exact branch instruction involved. The number of BRANCH_MISPREDICT entries for a given branch address at the end of the simulation matches the Mispredict count for that address that is shown in the BranchPrediction plugin statistics file.

The WAYPOINT trace source prints an event whenever an effective branch operation takes place. This event includes the address of the branch instruction, the target address of the branch, whether the branch is conditional, and whether it was taken. This trace source requires instruction prefetching to be enabled. Combined with a BRANCH_MISPREDICT trace event, it can be used to determine whether a branch was mispredicted as TAKEN or NOT_TAKEN.

To collect trace from these sources, run the model with the GenericTrace and BranchPrediction plugins. For example:

$PVLIB_HOME/examples/SystemCExport/EVS_Platforms/EVS_Base/Build_Cortex-A73x1/EVS_Base_Cortex-A73x1.x \
-C Base.bp.secure_memory=0 \
-C Base.cache_state_modelled=1 \
-C Base.cluster0.icache-prefetch_enabled=1 \
--plugin=$PVLIB_HOME/plugins/Linux64_GCC-4.8/BranchPrediction.so \
-C BranchPrediction.BranchPrediction.predictor-type=FixedDirectionPredictor \
-C BranchPrediction.BranchPrediction.mispredict-latency=11 \
-C BranchPrediction.BranchPrediction.bpstat-pathfilename=stats.txt \
--plugin=$PVLIB_HOME/plugins/Linux64_GCC-4.8/GenericTrace.so \
-C TRACE.GenericTrace.trace-sources=INST,BRANCH_MISPREDICT,WAYPOINT \
-C TRACE.GenericTrace.trace-file=trace.txt \
-a $PVLIB_HOME/images/ta_brpred.axf \
--stat
Example trace for a branch misprediction

The following example trace is for a branch misprediction with a misprediction latency of 11 ticks:

INST: PC=0x0000000080000628 OPCODE=0x7100655f SIZE=0x04 MODE=EL1h ISET=AArch64 
PADDR=0x0000000080000628 NSDESC=0x01 PADDR2=0x0000000080000628 NSDESC2=0x01 NS=0x01
ITSTATE=0x00 INST_COUNT=0x000000000001080b LOCAL_TIME=0x000000000003f7a0
CURRENT_TIME=0x000000002eab53a0 CORE_NUM=0x00 DISASS="CMP      w10,#0x19"

INST: PC=0x000000008000062c OPCODE=0x54000168 SIZE=0x04 MODE=EL1h ISET=AArch64
PADDR=0x000000008000062c NSDESC=0x01 PADDR2=0x000000008000062c NSDESC2=0x01 NS=0x01
ITSTATE=0x00 INST_COUNT=0x000000000001080c LOCAL_TIME=0x0000000000041eb0
CURRENT_TIME=0x000000002eab7ab0 CORE_NUM=0x00 DISASS="B.HI     {pc}+0x2c ; 0x80000658"

WAYPOINT: PC=0x000000008000062c ISET=AArch64 TARGET=0x0000000080000658
TARGET_ISET=AArch64 TAKEN=N IS_COND=Y CORE_NUM=0x00

BRANCH_MISPREDICT: PC=0x000000008000062c

INST: PC=0x0000000080000630 OPCODE=0x7100151f SIZE=0x04 MODE=EL1h ISET=AArch64
PADDR=0x0000000080000630 NSDESC=0x01 PADDR2=0x0000000080000630 NSDESC2=0x01 NS=0x01
ITSTATE=0x00 INST_COUNT=0x000000000001080d LOCAL_TIME=0x000000000005f370
CURRENT_TIME=0x000000002ead4f70 CORE_NUM=0x00 DISASS="CMP      w8,#5"

The following information can be gathered from this trace:

  • The branch instruction at address 0x8000062c was mispredicted, as shown by the BRANCH_MISPREDICT trace event.
  • The branch was conditional, and was incorrectly predicted as TAKEN, as shown by the TAKEN=N field in the WAYPOINT trace event. The PC field value from this source must correspond to the PC field value from the BRANCH_MISPREDICT source.
  • As a result of the misprediction, the instruction following the branch instruction took 120,000 picoseconds, or 12 ticks to complete. The misprediction latency was defined as 11 ticks, so the instruction would have taken only 1 tick to complete if the branch had been predicted correctly. The execution time is the difference between:
    • The CURRENT_TIME value for the INST trace before the BRANCH_MISPREDICT trace.
    • The CURRENT_TIME value for the INST trace after the BRANCH_MISPREDICT trace.
    The branch instruction itself took 10,000 picoseconds, or one tick to complete. This is important, as it shows that the misprediction latency is added to the instruction after the mispredicted branch instruction, not to the branch instruction itself. The execution time is the difference between the CURRENT_TIME values for the INST traces corresponding to the branch instruction and the instruction before.

The rest of this tutorial uses these techniques to compare the different branch prediction algorithms.

Branch prediction example program

The example is designed to use various types of branch operations that can take place during the execution of a program.

These operations are:

  • A branch to skip a loop after a fixed number of iterations has completed.
  • A branch to skip a code sequence, depending on the value of a variable.
  • A branch to skip a code sequence, which can only be executed a limited number of times consecutively, if a previous branch was taken.
  • A branch for a condition that is always true if the conditions for two previous branches were true.
  • A branch for a condition that is always true if the conditions for two previous branches were false.

The code operation is trivial. It looks for acronyms of a set maximum length within a constant string, and loops over this operation a set number of times. The string is:

Timing annotation can be used with an SVP, Split Virtual Platform, or an EVS, Exported Virtual Subsystem.

The code prints the acronyms SVP and EVS during each search operation. The complete source is provided in the file src/ta_brpred/main.c. The following code snippet shows the branch operations of interest:

// A: loop not entered 1/LOOP_COUNT times
for(j = 0; j < LOOP_COUNT; j++) {
	printf("Starting iteration #%d\n", j);
	blockCount = 0;
	c = 0;
	resetOnly(&acronymLength, acronym);
	// B: loop not entered 1/length times
	for(i = 0; i < length; i++) {
		c = string[i];
		// C: condition true
		// (number_of_block_letters)/(total_characters_in_string) times
		if (c >= 'A' && c <= 'Z') {
			blockCount++;
			// D: condition true up to MAX_LENGTH times consecutively
			if (acronymLength < MAX_LENGTH) {
				acronym[acronymLength] = c;
			}
			// E: condition true up to MAX_LENGTH+1 times consecutively
			if (acronymLength <= MAX_LENGTH) {
				acronymLength++;
			}
		}
		else {
			// F: condition true if E was true then C was false
			if (acronymLength > 1 && acronymLength <= MAX_LENGTH) {
				printAndReset(&acronymLength, acronym);
			}
			// G: condition true if E was false then C was false
			else if (acronymLength != 0) {
				resetOnly(&acronymLength, acronym);
			}
		}
	}
}

The branch instructions that are assembled for the conditions A to G in this code snippet can be examined using the branch prediction statistics and trace sources described previously.

The conditions are described in the following table. The branch behavior column describes the relationship between the condition and the associated branch instruction.

Table 10-9 Branch behavior for each condition

Condition Description Compiled instruction Branch behavior
A Outer loop for processing string LOOP_COUNT times. Loop not entered 1/LOOP_COUNT times. B.NE 0x800005f4 at address 0x80000698. Backwards branch. Taken to start of loop if more iterations remain.
B Inner loop for iterating through characters in the string.

B.NE 0x80000618 at address 0x8000068c.

Backwards branch. Taken to start of loop if more iterations remain.
C Condition true if the character being processed is upper case.

B.HI 0x80000658 at address 0x8000062c.

Forwards branch. Taken if the condition is false. Skips code that handles upper case characters.
D Condition true up to MAX_LENGTH times consecutively.

B.GE 0x80000644 at address 0x80000634.

Forwards branch. Taken if the condition is false. Skips code that appends a letter to an acronym.
E Condition true up to MAX_LENGTH+1 times consecutively.

B.GT 0x80000684 at address 0x80000648.

Forwards branch. Taken if the condition is false. Skips code that increments the acronym length.
F Condition true if E was true, after which C was false.

B.HI 0x80000674 at address 0x80000660.

Forwards branch. Never taken if the condition was true, that is, branch E was not taken and then branch C was taken. Skips the code to print a completed acronym.
G Condition true if E was false, after which C was false.

CBZ w8,0x80000684 at address 0x80000674.

Forwards branch. Never taken if the condition was true, that is, branch E was taken then branch C was taken. Skips the code to clear the saved acronym.

LOOP_COUNT and MAX_LENGTH are defined using a preprocessor macro and can be configured. This tutorial assumes that LOOP_COUNT is 20 and MAX_LENGTH is 5, as defined in the pre-compiled binary.

Running the simulation

To generate trace and statistics for comparing the performance of the different branch predictors, run the simulation with the BranchPrediction plugin parameters shown here.

For example, to use the FixedDirectionPredictor, launch the model using the following command:

$PVLIB_HOME/examples/SystemCExport/EVS_Platforms/EVS_Base/Build_Cortex-A73x1/EVS_Base_Cortex-A73x1.x \
-C Base.bp.secure_memory=0 \
-C Base.cache_state_modelled=1 \
-C Base.cluster0.icache-prefetch_enabled=1 \
--plugin=$PVLIB_HOME/plugins/Linux64_GCC-4.8/BranchPrediction.so \
-C BranchPrediction.BranchPrediction.predictor-type=FixedDirectionPredictor \
-C BranchPrediction.BranchPrediction.mispredict-latency=11 \
-C BranchPrediction.BranchPrediction.bpstat-pathfilename=stats.txt \
--plugin=$PVLIB_HOME/plugins/Linux64_GCC-4.8/GenericTrace.so \
-C TRACE.GenericTrace.trace-sources=INST,BRANCH_MISPREDICT,WAYPOINT \
-C TRACE.GenericTrace.trace-file=trace.txt \
-a $PVLIB_HOME/images/ta_brpred.axf \
--stat

The program prints the following output to the terminal:

Looking for acronyms of maximum length 5 in the string:
Timing annotation can be used with an SVP, Split Virtual Platform, or an EVS, Exported Virtual Subsystem.

Starting iteration #0
SVP
EVS
…
Starting iteration #19
SVP
EVS

Info: /OSCI/SystemC: Simulation stopped by user.

--- Base statistics: ----------------------------------------------------------
Simulated time                          : 0.002275s
User time                               : 0.343203s
System time                             : 0.202801s
Wall time                               : 0.642064s
Performance index                       : 0.00
Base.cluster0.cpu0                      : 0.31 MIPS (      171308 Inst)
-------------------------------------------------------------------------------

The end of simulation statistics, the branch prediction statistics file stats.txt, and the MTI trace file trace.txt, that are generated for each branch predictor type can now be used for analysis.

Comparison of branch predictor types

Statistics about the accuracy of the different branch predictors for the various types of branch instructions can now be compared.

These statistics are shown in the following table:

Table 10-10 Comparison of branch predictor accuracy

  Branch instruction
Branch predictor Statistic A B C D E F G
All Calls 20 2100 2100 260 260 1840 1800
TAKEN 19 2080 1840 0 0 1800 1800
NOT_TAKEN 1 20 260 260 260 40 0
FixedDirectionPredictor Mispredictions 1 20 260 260 260 40 0
Mispredicted as TAKEN 1 20 280 260 260 40 0
Mispredicted as NOT_TAKEN 0 0 0 0 0 0 0
Accuracy (%)

95

99

88

0 0

98

100

BiModalPredictor Mispredictions 1 20 341 1 1 40 0
Mispredicted as TAKEN 1 20 220 1 1 40 0
Mispredicted as NOT_TAKEN 0 0 121 0 0 0 0
Accuracy (%)

95

99

84

100

100

98

100

GSharePredictor Mispredictions 1 20 279 241 241 40 0
Mispredicted as TAKEN 1 20 260 241 241 40 0
Mispredicted as NOT_TAKEN 0 0 19 0 0 0 0
Accuracy (%)

95

99

87 7 7

98

100

CortexA53Predictor Mispredictions 1 23 324 2 1 49 0
Mispredicted as TAKEN 1 20 221 2 1 40 0
Mispredicted as NOT_TAKEN 0 3 103 0 0 9 0
Accuracy (%)

95

99

85 99

100

97

100

The accuracy figures have been rounded to the nearest percentage. For each branch instruction type, A to G, the entry for the best accuracy is shown in gray. As expected, different branch prediction algorithms are better suited to different types of branch instructions.

With the FixedDirectionPredictor, all branches are predicted as TAKEN, so the accuracy is equal to the percentage of calls to that branch that were TAKEN.

With the BiModalPredictor and GSharePredictor algorithms, only the random branch C was mispredicted both as TAKEN and NOT_TAKEN. With the other systematic branches, the misprediction was always in one direction. The result is different for the more complex algorithm of the CortexA53Predictor, which has mispredictions in both directions for systematic branches as well.

The BiModalPredictor is able to store the history of individual branches, and is therefore most accurate with predicting branches with a deterministic ratio between the number of times they are TAKEN and NOT_TAKEN. This accuracy can be seen with branches A, B, D, and E. With a more random branch, such as C, which depends entirely on the contents of a user-defined string, relying on the history of the branch proves ineffective.

Interestingly, the GSharePredictor appears to be highly inaccurate at predicting branches D and E. These branches are NOT_TAKEN a fixed number of times consecutively. However, since there are calls to many other branches between consecutive calls to these branches, the GSharePredictor’s global history is not able to use the specific outcome of these branches to update their prediction values effectively.

Overall, the BiModalPredictor and the CortexA53Predictor have predicted these branch instructions most accurately, as shown in the following table:

Table 10-11 Overall branch predictor accuracy

Predictor type Overall accuracy (%)
FixedDirectionPredictor 86
BiModalPredictor 98
GSharePredictor 86
CortexA53Predictor 98

Impact of branch misprediction on simulation time

You can directly observe the impact of mispredictions on the overall simulation time, as shown in the --stat output after the model exits.

The simulated execution times with the different branch predictors are shown in the following table.

Note:

The execution times also include the impact of branch mispredictions that occur in other parts of the code, as well as in the startup and shutdown sequences.

Table 10-12 Overall simulation time for each predictor type

Predictor type Simulation time with mispredict-latency=11 Simulation time with mispredict-latency=0
FixedDirectionPredictor 0.002275s 0.001713s
BiModalPredictor 0.001805s 0.001713s
GSharePredictor 0.002289s 0.001713s
CortexA53Predictor 0.001806s 0.001713s
Non-ConfidentialPDF file icon PDF versionARM 100965_1101_00_en
Copyright © 2014–2017 ARM Limited or its affiliates. All rights reserved.