7.2.1 Protection method

The methods used by the processor to detect, correct, and report RAM errors.

Detecting errors

The Cortex®‑R8 processor uses ECC to detect RAM data errors. ECC can also correct a single bit error that might occur in a chunk, where a chunk is typically one or two words of data. However, ECC cannot correct two or more bit errors in the same chunk.

Correcting errors

The Cortex®‑R8 processor implements RAM error correction using a clean and invalidate and retry for caches, and a correct, writeback, and retry mechanism for TCMs.

When a correctable error is detected, the corresponding index/way is cleaned and invalidated. When the clean and invalidate operation is completed, the requester retries its access.

Note:

The detection of multiple-bit errors is not synchronous. Therefore, when such an error is notified, corrupted data might not be contained. Contact ARM for more details about system containment of multiple-bit ECC errors.
Instruction cache
On the instruction side, lines are always clean so that invalidating the line is sufficient. The retried access then fetches the correct value from the upper level memory.
Data cache and TCMs
On the data side, the cache line can be dirty. The correction of the read contents is done as part of the clean and invalidate operation for caches. This takes place in the eviction buffer and in the cache coherency block. For TCMs, correction of the read contents is done with a correct and writeback operation.
SCU
The detection of an error in the duplicate of the tags of a core causes a clean and invalidate in the corresponding core tag RAM. When the clean and invalidate is done, the line in the SCU tag RAM is marked as unusable.

Handling permanent errors

Bank registers are used to mask faulty RAM locations if a hard error occurs. If a processor error occurs, the line is cleaned and invalidated and the ECC error bank prevents any future allocation. For an SCU error, the line is marked as unusable by the SCU error bank but the processor still sees the line as usable.

Permanent errors are handled as follows:

General behavior

If hard, or permanent, errors occur on the RAMs, the clean and invalidate, and retry scheme might cause a deadlock, and the access is continuously replayed. To prevent this, error bank registers are provided to mask the faulty locations as unusable and invalid. When an error is detected, the location is pushed in the bank that masks the corresponding valid bit of the location when reading and when allocating a new line. The line is therefore no longer used unless the entry is reset by a CP15 access. There is a short period during which the line is still seen by the system, but is removed from the allocation pool.

The depth of the error bank determines how many errors can be supported by the system. When this limit is reached, the system might deadlock. The processor provides a special ECC event indicating the number of corrupted locations to monitor the error bank status before it becomes full. This is a condition that can cause a potential deadlock. This information is reported on several pins signaling the usage of the error bank, that is, showing if the error bank is empty or at least one error has been encountered.

Cortex‑R8 is robust to hard-errors, but might require software intervention. When a single-bit error occurs in TCM, the corrected data is written to the error bank and then written back to TCM. The access is then replayed using the error bank data.

If a second single-bit error occurs in the TCM, the error bank is not written to, but corrected data is still written back to the TCM. This allows errors to be isolated. Isolating a hard error prevents its RAM location becoming a double-bit error that is not correctable.

Because subsequent errors do not overwrite the error bank, the replayed access uses the corrected data from the TCM. However, for soft or hard errors:

  • If the data written back to TCM has a correctable error, then the data is corrected and used. In this case, no software intervention is required.
  • If the data written back to TCM has a hard error, then the data is not corrected and results in permanently corrupted data, causing livelock. In this case, the RAMERR pin goes HIGH continuously.

Note:

If you do not apply any software intervention, a soft error can behave as a hard error, and if a second hard error then occurs, it might cause livelock. If a robust to hard errors system is not required, then you do not require software intervention.
Interaction between SCU and cores

For a core error, the line is cleaned and invalidated and the ECC error bank prevents any future allocation in this way. However, the line is still seen as present by the SCU, and the SCU requests the line to the core that misses or hits, depending on whether the line has been reallocated in another cache location.

For an SCU error, the line is marked as unusable by the SCU error bank but the core still sees the line as usable. Therefore, a core can request an access to this way to allocate the cache line, but the write fails in the SCU without being reported. Because of this, the error seen by SCU is sent back to the core, and stored in the core data error bank.

Reporting errors

The Cortex®‑R8 processor notifies the detection of any error using primary output events, and the update of performance and statistics counters.

Non-ConfidentialPDF file icon PDF versionARM 100400_0001_03_en
Copyright © 2015–2017 ARM Limited or its affiliates. All rights reserved.