| ARM Technical Support Knowledge Articles | |
Applies to: Software Development Toolkit (SDT)
Executing instructions & counting cycles
Consider the instruction: LDR r0, [r1]
For an uncached processor (such as the ARM7TDMI) operating from perfect memory, the number of cycles to execute a particular instruction is predictable.
However for a cached core this is not so, and there can be many factors affecting the time an instruction takes to execute. For example:
The situation is made more complex because ARM's cached cores also support streaming - whereby during cache line fills, information is made available to the core at the same time as it is written into the cache.
Under the ADS ARMulator, it is possible to examine cache and TLB (Translation Look-aside Buffer) and write buffer information by enabling verbose statistics for the relevant ARMulator memory model. For non-ARM9 cores, this is done by editing the armul.cnf file contained in the ADSin directory and setting the variable Counters= True. It is recommended that a copy of this file is taken before any modifications are performed.
For ARM9 cores, enabling verbose statistics requires an extra line to be added to the relevant memory model definitions. For example:
{ ARM920CacheMMU Counters=True ;; add this line
(this is only possible with ADS, not SDT).
Note that cached processor models have their caches enabled by default, with the lower 128MB being cacheable. Memory speeds can be set as a fraction of the core speeds using the armul.cnf variable MCCFG. For example, setting this variable to 3 will set the memory clock to be one third that of the core clock.
Certain ARMulator models such as the ARM920T operate by default in Fastbus mode and require a coprocessor write to switch to synchronous mode. This can be achieved by setting bit 30 in coprocessor 15 register 1. For example:
MRC p15, 0, r0, c1, c0, 0
BIC r0, r0, #0xc0000000
ORR r0, r0, #0x40000000
MCR p15, 0, r0, c1, c0, 0
Why does the ARMulator show zero N-cycles?
When ARMulating cores with MMUs and AMBA interfaces (e.g. 740T, 920T, 940T), you will not see any N-cycles in $statistics or $memstats, even if your code contains branch instructions. The only cycle counts shown by the ARMulator for these cores are the two AMBA cycle types:
ARMulator will always (correctly) show N-cycles=0 in its $statistics for these cores, because a non-sequential access is done with an A-cycle followed by an S-cycle ('merged I-S' cycle). Please refer to the Data Sheets for the AMBA interface description of cycle types for each core.
A-cycles are shown in $statistics under the heading 'I_Cycle' to correspond with the ARM7TDMI cycle labelling.
Why does the 940T seem slow when its cache is disabled?
When the ARM940T cache is disabled, each instruction (a single read which misses in the cache) will typically cost 4 I cycles (depending upon clock mode), followed by an S cycle. All the following steps are required in the worst case.
You can enable/disable the ARM940T cache/PU with:
''UsePageTables=True'' in armul.cnf to enable the cache/PU
''UsePageTables=False'' in armul.cnf to disable the cache/PU
An important point to note is that for small sequential code examples, where the cache is empty/unused, any cached processor (not just ARM940T) will perform worse than one with no cache. Cached processors will only show performance benefits with code that contains loops and with memory that requires wait states.
Article last edited on: 2008-09-09 15:47:30
Did you find this article helpful? Yes No
How can we improve this article?