cancel
Showing results for 
Search instead for 
Did you mean: 

FAQ: STM32H7 - How to read RAMECC_MxFAR failing address

Christophe VRIGNAUD
ST Employee

How to find the RAM ECC failing address?

  • How to access to the right RAMECC monitor to get the RAM failing address when an ECC error occurs?
  • How to translate the content of the failing address register to the actual RAM address (e.g. FADD= 0x2004)?

Note: For simplicity and clarity, the description is based on the STM32H74x/5x series. The description would be similar for other STM32H7 series and can be found in the release manual.
 

1. RAMECC controllers

The number of controllers depends on the STM32H7 series.
 

  • The STM32H74x/5x and STM32H72x/3x has one RAMECC controller per power domain.
    • RAMECC controller for Domain 1 (D1) = RAMECC1

    • RAMECC controller for Domain 2 (D2) = RAMECC2

    • RAMECC controller for Domain 3 (D3) = RAMECC3

​​​​​​​​​​

  • The STM32H7Ax/Bx has only one RAMECC controller.
    • RAMECC controller for the CPU Domain (CD) = RAMECC

2. RAMECC monitors

The list of monitors vary depending on the STM32H7 series.
The STM32H74x/5x has one monitor for each RAM block in the MCU.

  • RAMECC1 controller has 5 monitors for D1
  • RAMECC2 controller has 5 monitors for D2
  • RAMECC3 controller has 2 monitors for D3

The detail is given in table 11 of the release manual.

 

ChristopheVRIGNAUD_0-1694682627727.png

 

 

3. RAMECC registers address

From table 8 of the release manual:

 
Boundary address Peripheral
0x58027000 - 0x580273FF RAMECC3
0x52009000 - 0x520093FF RAMECC1
0x48023000 - 0x480233FF RAMECC2


 

4. RAMECC exception vectors

From table 143 of the release manual:

 
Signal Priority NVIC
position
Acronym Description Address offset
ramecc1_it 152 145 RAMECC1 ECC diagnostic global interrupt for RAMECC D1 0x0000 0284
ramecc2_it RAMECC2 ECC diagnostic global interrupt for RAMECC D2
ramecc3_it RAMECC3 ECC diagnostic global interrupt for RAMECC D3

 

5. RAMECC registers - e.g. RAMECC1

For RAMECC1, the boundary addresses are 0x52009000 - 0x520093FF (table 8 of the release manual).
The addresses of the registers for each monitor are as follows:

  • RAMECC_IER      --  interrupt enable
=> Address offset: 0x00
=>  0x52009000
  • RAMECC_MxCR  --  configuration
=> Address offset: 0x20 * x
=> x = ECC monitoring unit number
=> 0x52009000 + 0x20 * x                  with x = [1..5]
 
=> 0x52009020   Monitor 1 -  AXI SRAM ECC monitoring unit                           512Kb
=> 0x52009040   Monitor 2 -  ITCM-RAM ECC monitoring unit                            64kB
=> 0x52009060   Monitor 3 -  DTCM-RAM ECC monitoring unit for D0TCM        64Kb
=> 0x52009080   Monitor 4 -  DTCM-RAM ECC monitoring unit for D1TCM        64Kb
=> 0x520090a0   Monitor 5 -  ETM RAM ECC monitoring unit                                4Kb

 

  • RAMECC_MxSR  --  status
=> Address offset: 0x24 + 0x20 * (x - 1)
=> x = ECC monitoring unit number
 
=> 0x52009000 + 0x24 + 0x20 * (x-1)   with x = [1..5]
=> 0x52009024 + 0x20 * (x-1)               with x = [1..5]
 
=> 0x52009024   Monitor 1
=> 0x52009044   Monitor 2
=> 0x52009064   Monitor 3
=> 0x52009084   Monitor 4
=> 0x520090a4   Monitor 5
 
  • RAMECC_MxFAR  --  failing address
=> Address offset: 0x28 + 0x20 * (x-1)
=> x = ECC monitoring unit number
 
=> 0x52009000 + 0x28 + 0x20 * (x-1)   with x = [1..5]
=> 0x52009028 + 0x20 * (x-1)               with x = [1..5]
 
=> 0x52009028   Monitor 1
=> 0x52009048   Monitor 2
=> 0x52009068   Monitor 3
=> 0x52009088   Monitor 4
=> 0x520090a8   Monitor 5
 
  • RAMECC_MxFDRL  --  failing data low
=> Address offset: 0x2C + 0x20 * (x-1)
=> x = ECC monitoring unit number
 
=> 0x52009000 + 0x2c + 0x20 * (x-1)   with x = [1..5]
=> 0x5200902c + 0x20 * (x-1)               with x = [1..5]
 
=> 0x5200902c   Monitor 1
=> 0x5200904c   Monitor 2
=> 0x5200906c   Monitor 3
=> 0x5200908c   Monitor 4
=> 0x520090ac   Monitor 5
 
  • RAMECC_MxFDRH  --  failing data high
=> Address offset: 0x30 + 0x20 * (x-1)
=> x = ECC monitoring unit number
 
=> 0x52009000 + 0x30 + 0x20 * (x-1)    with x = [1..5]
=> 0x52009030 + 0x20 * (x-1)                with x = [1..5]
 
=> 0x52009030   Monitor 1
=> 0x52009050   Monitor 2
=> 0x52009070   Monitor 3
=> 0x52009090   Monitor 4
=> 0x520090b0   Monitor 5
 
  • RAMECC_MxFECR  --  failing ECC error code
=> Address offset: 0x34 + 0x20 * (x-1)
=> x = ECC monitoring unit number
 
=> 0x52009000 + 0x34 + 0x20 * (x-1)   with x = [1..5]
=> 0x52009034 + 0x20 * (x-1)               with x = [1..5]
 
=> 0x52009034   Monitor 1
=> 0x52009054   Monitor 2
=> 0x52009074   Monitor 3
=> 0x52009094   Monitor 4
=> 0x520090b4   Monitor 5

6. How to get the failing address from the content of RAMECC_MxFAR

The information given by the RAMECC_MxFAR register is described in the release manual as the failing address (FADD) for the monitor x.
 

Bits 31:0   FADD[31:0]: ECC error failing address
 When an ECC error occurs the FADD bitfield contains the address that generated the ECC error.

In fact, the address in FADD[31:0] is relative and it points to a word, not a bit. To calculate the actual RAM address, the following formula must be applied:
 
RAM Address  =  RAM memory start address  +  FADD  *  word size in byte
 

Example with FADD= 0x2004:

  • For 64-bit word size memory like AXI SRAM:

0x2400 0000 + 0x2004 * 8 = 0x2401 0020

  • For 32-bit word size memory like SRAM1:

0x3000 0000 + 0x2004 * 4 = 0x3000 8010

7. Related links

RAM ECC monitoring is described in the release manual  -  section 3  -  RAM ECC monitoring (RAMECC).

  • RM0433 STM32H742, STM32H743/753 and STM32H750 Value line advanced Arm®-based 32-bit MCUs
  • RM0468 STM32H723/733, STM32H725/735 and STM32H730 Value line advanced Arm®-based 32-bit MCUs
  • RM0455 STM32H7A3/7B3 and STM32H7B0 Value line advanced Arm®-based 32-bit MCUs

The error correction code (ECC) management and implementation on STM32 microcontrollers is described in:

  • AN5342 How to use error correction code (ECC) management for internal memories protection on STM32 MCUs

This article was originally published 2022-06-02.
Before republication the article had 271 views.

Comments
andy_long
Associate III

Hi @Christophe VRIGNAUD ,

  1. I use STM32H743 and I get M3FAR (RAMECC1) as 0x00003B22. From the datasheet, I understand that the area is DTCM-RAM and I believe the word size is 32-bit.  0x20000000 + 0x3B22 * 4 = 0x2000 EC88. Is this correct ?
  2. How is register RAMECC monitor x failing ECC error code register (RAMECC_MxFECR) useful ?
  3. In an ideal scenario, what should be the fw's action if it detects a double ecc error ?
KORourke
Associate

@Christophe VRIGNAUD Application note AN5432 also mentions that "A special case is DTCM and other memories, which are interleaved" and goes on to explain how to calculate the failing address for those.

Are there other memories that are interleaved? I can't find any information about this in the STM32H743 reference manual (I have RM0433 rev 7).

theHolgi
Associate II

@Christophe VRIGNAUD 

That does not really work out. I got the following notification (on STM32H753AII):

RAMECC1_M1SR (0x52009044) = 3 (Single and double ECC error)
OK, so this is D1 monitor 2 which belongs to ITCM. Should be a 64-bit bus.

RAMECC1_M1FAR (0x52009044) = 0x3ffd
RAMECC1_M1FDRL (0x52009048) = 0x152f2fef
RAMECC1_M1FDRH (0x52009044) = 0xa9ebe8f8

ITCM base address is 0x00000000.
Failing address is supposed to be 0x00000000 + (0x3ffd * 8 ) = 0x01ffe8.
There is nothing mapped to this address; naively writing the data there results in HardFault.

Shall I mask the address register first to a few lower bits? (Since the maximum size is 64k, and scaling factor is 8, only the lower 13 bits could possibly be set for a valid address.)

Indeed, if I just mask the FAR content like described, I get 0x1ffd to work with which results in memory address 0x0FFE8. The content there in the Memory monitor matches the provided Data Register content.

Is it safe to just mask it like this? Are the FAR higher bits don't care or do they carry additional information?

 

theHolgi
Associate II

@KORourke 

Putting together RM0433 table 11 (ECC controller mapping) and table 7 (memory layout),
I came to the conclusion that DTCM, SRAM1 and SRAM2 are those meant as "interleaved", which means that despite of looking like continuous RAM they are subdivided into 64kB slices.

Rest of comment redacted: False speculation. See below.

theHolgi
Associate II

@KORourke Now I know what the "interleaved" means.

For DTCM, it's not that the two RAM blocks are put one after another, like it is the case for D2 SRAM.

They are interwoven, i.e. we have a 64-bit bus but accessing two 32-bit units of RAM (of 64 kB = 16 kWords each) simultaneously. So for address calculation you multiply the FAR content by 8, but being a 32-bit monitor, each unit will only provide you data in the FDRL register.

DTCM ramRAMECC1_Monitor3
(0x52009060)
DTCM ramRAMECC1_Monitor4
(0x52009080)
0x20000000FAR == 00x20000004FAR == 0
0x20000008FAR == 10x2000000AFAR == 1
0x20000010FAR == 20x20000014FAR == 2
............

My complete RAM ECC handler now looks like this.
I added the backup SRAM for a complete example (i.e. I did not test it); adjust to your needs.

#define NUMBER_OF_ECC_MONITORS 11
static struct
{
RAMECC_MonitorTypeDef* monitor;
uint32_t ram_base;
uint16_t mask;
} const RAMECC_Instances[NUMBER_OF_ECC_MONITORS] = {
{RAMECC1_Monitor1, D1_AXISRAM_BASE, 0xffff}, /** RAMECC1 M1 : AXI SRAM (512 kB) */
{RAMECC1_Monitor2, D1_ITCMRAM_BASE, 0x1fff}, /** RAMECC1 M2 : ITCM-RAM (64 kB) */
{RAMECC1_Monitor3, D1_DTCMRAM_BASE, 0x1fff}, /** RAMECC1 M3 : D0TCM-RAM (64 kB) */
{RAMECC1_Monitor4, D1_DTCMRAM_BASE + 4, 0x1fff}, /** RAMECC1 M4 : D1TCM-RAM (64 kB) */
{RAMECC2_Monitor1, D2_AHBSRAM_BASE + 0x00000, 0x3fff}, /** RAMECC2 M1 : SRAM1_0 (64 kB) */
{RAMECC2_Monitor2, D2_AHBSRAM_BASE + 0x10000, 0x3fff}, /** RAMECC2 M2 : SRAM1_1 (64 kB) */
{RAMECC2_Monitor3, D2_AHBSRAM_BASE + 0x20000, 0x3fff}, /** RAMECC2 M3 : SRAM2_0 (64 kB) */
{RAMECC2_Monitor4, D2_AHBSRAM_BASE + 0x30000, 0x3fff}, /** RAMECC2 M4 : SRAM2_1 (64 kB) */
{RAMECC2_Monitor5, D2_AHBSRAM_BASE + 0x40000, 0x1fff}, /** RAMECC2 M5 : SRAM3 (32 kB) */
{RAMECC3_Monitor1, D3_SRAM_BASE, 0x3fff}, /** RAMECC3 M1 : SRAM4 (64 kB) */
{RAMECC3_Monitor2, D3_BKPSRAM_BASE, 0x03ff} /** RAMECC3 M2 : Backup SRAM (4 kB) */
};
RAMECC_HandleTypeDef hramecc[NUMBER_OF_ECC_MONITORS];
// Direct-access handles for the two DTCM monitors
#define hramecc_d0 (&hramecc[2])
#define hramecc_d1 (&hramecc[3])

/**
* @brief RAMECC Initialization Function
*/
void MX_RAMECC_Init(void)
{
for (int n = 0; n < NUMBER_OF_ECC_MONITORS; ++n)
{
hramecc[n].Instance = RAMECC_Instances[n].monitor;
if (HAL_RAMECC_Init(&hramecc[n]) != HAL_OK)
{
Error_Handler();
}
// Be notified about all errors
HAL_RAMECC_StartMonitor(&hramecc[n]);
HAL_RAMECC_EnableNotification(&hramecc[n], RAMECC_IT_MONITOR_ALL);
HAL_RAMECC_EnableNotification(&hramecc[n], RAMECC_IT_GLOBAL_ALL);
}
HAL_NVIC_SetPriority(ECC_IRQn, 1, 0);
HAL_NVIC_EnableIRQ(ECC_IRQn);
}

/* ECC interrupt handler */
void __USED ECC_IRQHandler(void)
{
RAMECC_HandleTypeDef* handle;
for (int n = 0; n < NUMBER_OF_ECC_MONITORS; ++n)
{
handle = &hramecc[n];
if (handle->Instance->SR != 0) // Find out which handler fired
{
// All RAMECC1 busses are 64 bit; all others are 32 bit
bool is_64bit = ((uint32_t)handle->Instance & ~0xff) == RAMECC1_BASE;
uintptr_t address = HAL_RAMECC_GetFailingAddress(handle) & RAMECC_Instances[n].mask;
address = RAMECC_Instances[n].ram_base + (address << (is_64bit ? 3 : 2));
if (HAL_RAMECC_IsECCDoubleErrorDetected(handle))
{
// Handle Non-correctable ECC Error; e.g. reset
}
if ((handle == hramecc_d0) || (handle == hramecc_d1))
{
// Making an incomplete read on the failed RAM makes the monitor freak out completely.
if ((hramecc_d0->Instance->SR != 0) && (hramecc_d1->Instance->SR != 0) && (hramecc_d0->Instance->FAR == hramecc_d1->Instance->FAR))
{ // Problem with both RAMs; do a single 64-bit write.
uint32_t content_l = HAL_RAMECC_GetFailingDataLow(hramecc_d0);
uint32_t content_h = HAL_RAMECC_GetFailingDataLow(hramecc_d1);
*(uint64_t*)(address & ~0x07) = ((uint64_t)content_h << 32) | content_l;
} else {
// Problem only with one; do a (incomplete) 32-bit write.
uint32_t content = HAL_RAMECC_GetFailingDataLow(handle);
*(uint32_t*)address = content;
}
__DSB();
__HAL_RAMECC_CLEAR_FLAG(hramecc_d0, RAMECC_FLAGS_ALL);
__HAL_RAMECC_CLEAR_FLAG(hramecc_d1, RAMECC_FLAGS_ALL);
} else {
if (is_64bit)
{
uint64_t content = ((uint64_t)HAL_RAMECC_GetFailingDataHigh(handle) << 32) | HAL_RAMECC_GetFailingDataLow(handle);
*(uint64_t*)address = content;
} else {
uint32_t content = HAL_RAMECC_GetFailingDataLow(handle);
*(uint32_t*)address = content;
}
__DSB();
/* Clear all flags */
__HAL_RAMECC_CLEAR_FLAG(handle, RAMECC_FLAGS_ALL);
}
}
}
}

Notes: Since the bus is 64 bit, a 32-bit write to DTCM will be incomplete and therefore read the other half-word from the other RAM. This is where it gets messy.
When I treat both sides independent with 32-bit memory writes, a successful handling of two simultaneously occurring failures (read uninitialized RAM) is always followed by a false signaling of an error at address 0 and data 0.
When I try to put together a 64-bit write from combining the FDRL register and a 32-bit RAM access to the other part (if it is not corrupted), the successful handling of a single fault is followed by a false signaling of a double fault with one data correct and the other data 0.
Although the above code does effectively the same (32-bit write to only the corrupted part must combine it with reading the other RAM; always reset both monitors after the write is important since it might raise new flags), it works for me and although it might hide failures (double-bit failure in M4 does not run into the potential reset handler, when handled together with a 1-bit failure in M3), I find that case esoteric and at least the code does not produce false flags that lead to memory nuking.

Any version would probably do when handling just randomly occurring bitflips. But if I successfuly can sift through uninitialized RAM, it makes a robust impression.

I didn't use the HAL interrupt handler on purpose; it clears the flags before calling the user handler and I would have to find the instance of my structure twice.

In case the described problems are a hardware bug: My device is a STMH753AII rev. V (DBGMCU_IDC = 0x20046450)

KORourke
Associate

@theHolgi sorry I couldn't reply earlier, I was ill last week. I ended up working out the "interwoven" nature of DTCM RAMECC and ended up with code similar to yours (although in the form of a Zephyr device driver).

I was never able to test single error handling properly since I only ever saw double errors from reading uninitialised RAM. So I never saw the interesting problems you've found with 32-bit accesses to DTCM.

Christophe VRIGNAUD
ST Employee

Hi @andy_long 

1. DTCM-RAM is a specific RAM.
The Release Manual (RM0433 for the STM32H743) explains that the DTCM bus is 2x32-bit and why it is so:
"The 2x32-bit DTCM bus allows load/load and load/store instruction pairs to be dual-issued on the DTCM memory."
The DTCM-RAM content is interleaved in D0TCM and D1TCM.
The AN5342 - section 3.1.3 Interpreting FAR - explains how to calculate the failing address in this case: the word size is 8 bytes. The 64 bits which are read at a given address are composed of 32 bits from D0TCM followed by 32 bits from D1TCM. Each 32-bit data has its own ECC.

In your case, depending on which of monitor 3 (D0TCM) or monitor 4 (D1TCM) has signalled the error, the 32-bit failing address is:
- for D0TCM: 0x2000 0000 + 0x3B22 * 8 = 0x2001 D910
- for D1TCM: 0x2000 0004 + 0x3B22 * 8 = 0X2001 D914


2. The ECC error code given in RAMECC_MxFECR is based on SECDED algorithm-Hamming code. This register can help to determine the error position, in the 32-bit data in the DTCM-RAM case.
For more details regarding position detection, please refer to the Hamming code standard.

3. AN5342 - section 3.1.2 ECC ISR - gives a possible scenario which consists in rewriting the failing address data.

In case of double bit error, no correction is possible, the data is lost. It depends on your application if the data is recoverable from acquiring it again, calculating it again or getting it from a second storage area.
For example, if it is a sensible data in a safety application, you would store the same data or the complemented data in a separate area.

In case of single bit error, the error is only corrected on the data which has been read in the memory. But it remains in the original location. It is recommended to re-write the faulty location in order to remove the single bit error and to avoid the risk of a double bit error.

Christophe VRIGNAUD
ST Employee

Hi @KORourke 

In STM32H743 or other STM32H7 series, only DTCM has this interleave specificity.
The AN5342 has a special focus on the STM32H5 and STM32H7 series microcontrollers but doesn't concern only these microcontrollers. It is also dealing with flash ECC. "other memories, which are interleaved" does not refer to the STM32H7 but refers to e.g. the dual-bank flash of the STM32G4.

The STM32H743 has 2 monitors in RAMECC2 for SRAM1 (monitor 1 and monitor 2) and SRAM2 (monitor 3 and monitor 4), each monitor covering 64 Kbytes of the SRAM. But in this case, it just has an impact on the start address.

- monitor 1 for SRAM1_0: RAM failing address = 0x3000 0000 + FADD * 4
- monitor 2 for SRAM1_1: RAM failing address = 0x3001 0000 + FADD * 4

- monitor 3 for SRAM2_0: RAM failing address = 0x3002 0000 + FADD * 4
- monitor 2 for SRAM1_1: RAM failing address = 0x3003 0000 + FADD * 4

Christophe VRIGNAUD
ST Employee

Hi @theHolgi 

RAMECC1_M1SR is the status register of RAMECC1 monitor 1 = 0x52009024. This is the AXI SRAM monitor.

 

As RAMECC1_M1FAR (0x52009028) = 0x3ffd, the ECC failure is in AXI SRAM at address:

RAM failure address = 0x2400 0000 + 0x3ffd * 8 = 0x2401 FFE8.

 

The Failure address read in the FAR register is directly usable. You can forget all your mask stuff; it's not needed at all.

theHolgi
Associate II

@Christophe VRIGNAUD 

OK, let's do it one more time.

The ECC interrupt occurs; scanning the state registers reveals a 1 at address 0x52009044 (single bit failure).

Bildschirmfoto vom 2024-07-06 21-44-23.png

This belongs to D1 domain (RAMECC1) monitor 2 (address offset 0x40), which is, according to RM0433 Table 11, assigned to ITCM-RAM which is 64 bit.

FAR at 0x52009048 reveals 0x3F9C.
RAM address = ITCM base + (8 * FAR) = 0x00000000 + (8 * 0x3F9C) = 0x001FCE0

This is an invalid address. The last valid address of ITCM would be 0x0000FFFF.
On the other hand, assuming that only 13 bits of FAR can possibly be valid, the content would be 0x1F9C which results in RAM address 0x0000FCE0, which fits pretty well when comparing the content of the two failing data registers and the observed RAM content from Memory monitor.Bildschirmfoto vom 2024-07-06 21-44-39.png

So the question still is, why is there sometimes a stray extra bit set. (I've never observed it at a higher position).
But obviously it can just be ignored. I have not yet observed this behavior for any other monitor than ITCM, actually.

Christophe VRIGNAUD
ST Employee

Hi @theHolgi 

From which IDE does this screenshot come from? And which version of the tool? On which OS?
Which debugger do you use? (I've found a picture of STM32CubeIDE on MacOS with JLink displaying the SFRs in the memory view: from this I would guess that you are using JLink, right?)

I don't have this view with STM32CubeIDE version 1.16.0 and STLink-V3 debugger. I have a separate SFR registers view as shown in the picture below.

2024-07-12 19_02_02-.png There is no RAMECC_M0xx registers; it starts with RAMECC_M1xx for monitor 1.

It seems there is a shift in your view, with the values of monitor 1 displayed with the addresses of monitor 2.

Moreover, the RAMECC_M1FDRL = 0xA3837F0E. In memory, the value at @0x0000FCE0 is 0xA3837F1E, which is a bit different.
Have you verified that you don't find the content of the FDRL and FDRH registers in AXI SRAM?

theHolgi
Associate II

Hi @Christophe VRIGNAUD ,

I am using Eclipse on Linux; the connection is with JLink so it is using JLinkGDBServer on the host side.

The SFR view is determined by the SVD file you use. The official SVD has the ECC units like you are showing, but it is missing other peripherals so I am using another SVD that I found on the internet. Should make my own best-of...
Anyway, that does not invalidate the registers' content and is why I was always citing the register addresses to remove ambiguity.

About the difference between FDR and memory view, that's actually kind of interesting since with the single ECC error that is signaled, the content should still be unambiguous even on a second reading, which the debugger connection will do.

My overall imression is that the hardware will not like multiple reads on failing memory, or follow-up errors on a different address (the memory view reads a fairly large section of uninitialized RAM), and therefore I am generous about 1 bit difference to the expected value.

I would make sure to NOT have any memory view open before you reach the fault handler and have written down the peripheral's registers, and then trust the memory view only for plausibility cross-reference.

Version history
Last update:
‎2024-07-01 07:03 AM
Updated by:
Contributors