2023-09-14 02:13 AM - edited 2024-07-01 07:03 AM
Note: For simplicity and clarity, the description is based on the STM32H74x/5x series. The description would be similar for other STM32H7 series and can be found in the release manual.
The number of controllers depends on the STM32H7 series.
RAMECC controller for Domain 1 (D1) = RAMECC1
RAMECC controller for Domain 2 (D2) = RAMECC2
RAMECC controller for Domain 3 (D3) = RAMECC3
The list of monitors vary depending on the STM32H7 series.
The STM32H74x/5x has one monitor for each RAM block in the MCU.
The detail is given in table 11 of the release manual.
From table 8 of the release manual:
Boundary address | Peripheral |
---|---|
0x58027000 - 0x580273FF | RAMECC3 |
0x52009000 - 0x520093FF | RAMECC1 |
0x48023000 - 0x480233FF | RAMECC2 |
From table 143 of the release manual:
Signal | Priority | NVIC position |
Acronym | Description | Address offset |
---|---|---|---|---|---|
ramecc1_it | 152 | 145 | RAMECC1 | ECC diagnostic global interrupt for RAMECC D1 | 0x0000 0284 |
ramecc2_it | RAMECC2 | ECC diagnostic global interrupt for RAMECC D2 | |||
ramecc3_it | RAMECC3 | ECC diagnostic global interrupt for RAMECC D3 |
For RAMECC1, the boundary addresses are 0x52009000 - 0x520093FF (table 8 of the release manual).
The addresses of the registers for each monitor are as follows:
The information given by the RAMECC_MxFAR register is described in the release manual as the failing address (FADD) for the monitor x.
Example with FADD= 0x2004:
|
0x2400 0000 + 0x2004 * 8 = 0x2401 0020 |
|
0x3000 0000 + 0x2004 * 4 = 0x3000 8010 |
RAM ECC monitoring is described in the release manual - section 3 - RAM ECC monitoring (RAMECC).
The error correction code (ECC) management and implementation on STM32 microcontrollers is described in:
This article was originally published 2022-06-02.
Before republication the article had 271 views.
Hi @Christophe VRIGNAUD ,
@Christophe VRIGNAUD Application note AN5432 also mentions that "A special case is DTCM and other memories, which are interleaved" and goes on to explain how to calculate the failing address for those.
Are there other memories that are interleaved? I can't find any information about this in the STM32H743 reference manual (I have RM0433 rev 7).
@Christophe VRIGNAUD
That does not really work out. I got the following notification (on STM32H753AII):
RAMECC1_M1SR (0x52009044) = 3 (Single and double ECC error)
OK, so this is D1 monitor 2 which belongs to ITCM. Should be a 64-bit bus.
RAMECC1_M1FAR (0x52009044) = 0x3ffd
RAMECC1_M1FDRL (0x52009048) = 0x152f2fef
RAMECC1_M1FDRH (0x52009044) = 0xa9ebe8f8
ITCM base address is 0x00000000.
Failing address is supposed to be 0x00000000 + (0x3ffd * 8 ) = 0x01ffe8.
There is nothing mapped to this address; naively writing the data there results in HardFault.
Shall I mask the address register first to a few lower bits? (Since the maximum size is 64k, and scaling factor is 8, only the lower 13 bits could possibly be set for a valid address.)
Indeed, if I just mask the FAR content like described, I get 0x1ffd to work with which results in memory address 0x0FFE8. The content there in the Memory monitor matches the provided Data Register content.
Is it safe to just mask it like this? Are the FAR higher bits don't care or do they carry additional information?
Putting together RM0433 table 11 (ECC controller mapping) and table 7 (memory layout),
I came to the conclusion that DTCM, SRAM1 and SRAM2 are those meant as "interleaved", which means that despite of looking like continuous RAM they are subdivided into 64kB slices.
Rest of comment redacted: False speculation. See below.
@KORourke Now I know what the "interleaved" means.
For DTCM, it's not that the two RAM blocks are put one after another, like it is the case for D2 SRAM.
They are interwoven, i.e. we have a 64-bit bus but accessing two 32-bit units of RAM (of 64 kB = 16 kWords each) simultaneously. So for address calculation you multiply the FAR content by 8, but being a 32-bit monitor, each unit will only provide you data in the FDRL register.
DTCM ram | RAMECC1_Monitor3 (0x52009060) | DTCM ram | RAMECC1_Monitor4 (0x52009080) |
0x20000000 | FAR == 0 | 0x20000004 | FAR == 0 |
0x20000008 | FAR == 1 | 0x2000000A | FAR == 1 |
0x20000010 | FAR == 2 | 0x20000014 | FAR == 2 |
... | ... | ... | ... |
My complete RAM ECC handler now looks like this.
I added the backup SRAM for a complete example (i.e. I did not test it); adjust to your needs.
#define NUMBER_OF_ECC_MONITORS 11
static struct {
RAMECC_MonitorTypeDef* monitor;
uint32_t ram_base;
uint16_t mask;
} const RAMECC_Instances[NUMBER_OF_ECC_MONITORS] = {
{RAMECC1_Monitor1, D1_AXISRAM_BASE, 0xffff}, /** RAMECC1 M1 : AXI SRAM (512 kB) */
{RAMECC1_Monitor2, D1_ITCMRAM_BASE, 0x1fff}, /** RAMECC1 M2 : ITCM-RAM (64 kB) */
{RAMECC1_Monitor3, D1_DTCMRAM_BASE, 0x1fff}, /** RAMECC1 M3 : D0TCM-RAM (64 kB) */
{RAMECC1_Monitor4, D1_DTCMRAM_BASE + 4, 0x1fff}, /** RAMECC1 M4 : D1TCM-RAM (64 kB) */
{RAMECC2_Monitor1, D2_AHBSRAM_BASE + 0x00000, 0x3fff}, /** RAMECC2 M1 : SRAM1_0 (64 kB) */
{RAMECC2_Monitor2, D2_AHBSRAM_BASE + 0x10000, 0x3fff}, /** RAMECC2 M2 : SRAM1_1 (64 kB) */
{RAMECC2_Monitor3, D2_AHBSRAM_BASE + 0x20000, 0x3fff}, /** RAMECC2 M3 : SRAM2_0 (64 kB) */
{RAMECC2_Monitor4, D2_AHBSRAM_BASE + 0x30000, 0x3fff}, /** RAMECC2 M4 : SRAM2_1 (64 kB) */
{RAMECC2_Monitor5, D2_AHBSRAM_BASE + 0x40000, 0x1fff}, /** RAMECC2 M5 : SRAM3 (32 kB) */
{RAMECC3_Monitor1, D3_SRAM_BASE, 0x3fff}, /** RAMECC3 M1 : SRAM4 (64 kB) */
{RAMECC3_Monitor2, D3_BKPSRAM_BASE, 0x03ff} /** RAMECC3 M2 : Backup SRAM (4 kB) */
};
RAMECC_HandleTypeDef hramecc[NUMBER_OF_ECC_MONITORS];
// Direct-access handles for the two DTCM monitors
#define hramecc_d0 (&hramecc[2])
#define hramecc_d1 (&hramecc[3])
/**
* @brief RAMECC Initialization Function
*/
void MX_RAMECC_Init(void)
{
for (int n = 0; n < NUMBER_OF_ECC_MONITORS; ++n)
{
hramecc[n].Instance = RAMECC_Instances[n].monitor;
if (HAL_RAMECC_Init(&hramecc[n]) != HAL_OK)
{
Error_Handler();
}
// Be notified about all errors
HAL_RAMECC_StartMonitor(&hramecc[n]);
HAL_RAMECC_EnableNotification(&hramecc[n], RAMECC_IT_MONITOR_ALL);
HAL_RAMECC_EnableNotification(&hramecc[n], RAMECC_IT_GLOBAL_ALL);
}
HAL_NVIC_SetPriority(ECC_IRQn, 1, 0);
HAL_NVIC_EnableIRQ(ECC_IRQn);
}
/* ECC interrupt handler */
void __USED ECC_IRQHandler(void)
{
RAMECC_HandleTypeDef* handle;
for (int n = 0; n < NUMBER_OF_ECC_MONITORS; ++n)
{
handle = &hramecc[n];
if (handle->Instance->SR != 0) // Find out which handler fired
{
// All RAMECC1 busses are 64 bit; all others are 32 bit
bool is_64bit = ((uint32_t)handle->Instance & ~0xff) == RAMECC1_BASE;
uintptr_t address = HAL_RAMECC_GetFailingAddress(handle) & RAMECC_Instances[n].mask;
address = RAMECC_Instances[n].ram_base + (address << (is_64bit ? 3 : 2));
if (HAL_RAMECC_IsECCDoubleErrorDetected(handle))
{
// Handle Non-correctable ECC Error; e.g. reset
}
if ((handle == hramecc_d0) || (handle == hramecc_d1))
{
// Making an incomplete read on the failed RAM makes the monitor freak out completely.
if ((hramecc_d0->Instance->SR != 0) && (hramecc_d1->Instance->SR != 0) && (hramecc_d0->Instance->FAR == hramecc_d1->Instance->FAR))
{ // Problem with both RAMs; do a single 64-bit write.
uint32_t content_l = HAL_RAMECC_GetFailingDataLow(hramecc_d0);
uint32_t content_h = HAL_RAMECC_GetFailingDataLow(hramecc_d1);
*(uint64_t*)(address & ~0x07) = ((uint64_t)content_h << 32) | content_l;
} else {
// Problem only with one; do a (incomplete) 32-bit write.
uint32_t content = HAL_RAMECC_GetFailingDataLow(handle);
*(uint32_t*)address = content;
}
__DSB();
__HAL_RAMECC_CLEAR_FLAG(hramecc_d0, RAMECC_FLAGS_ALL);
__HAL_RAMECC_CLEAR_FLAG(hramecc_d1, RAMECC_FLAGS_ALL);
} else {
if (is_64bit)
{
uint64_t content = ((uint64_t)HAL_RAMECC_GetFailingDataHigh(handle) << 32) | HAL_RAMECC_GetFailingDataLow(handle);
*(uint64_t*)address = content;
} else {
uint32_t content = HAL_RAMECC_GetFailingDataLow(handle);
*(uint32_t*)address = content;
}
__DSB();
/* Clear all flags */
__HAL_RAMECC_CLEAR_FLAG(handle, RAMECC_FLAGS_ALL);
}
}
}
}
Notes: Since the bus is 64 bit, a 32-bit write to DTCM will be incomplete and therefore read the other half-word from the other RAM. This is where it gets messy.
When I treat both sides independent with 32-bit memory writes, a successful handling of two simultaneously occurring failures (read uninitialized RAM) is always followed by a false signaling of an error at address 0 and data 0.
When I try to put together a 64-bit write from combining the FDRL register and a 32-bit RAM access to the other part (if it is not corrupted), the successful handling of a single fault is followed by a false signaling of a double fault with one data correct and the other data 0.
Although the above code does effectively the same (32-bit write to only the corrupted part must combine it with reading the other RAM; always reset both monitors after the write is important since it might raise new flags), it works for me and although it might hide failures (double-bit failure in M4 does not run into the potential reset handler, when handled together with a 1-bit failure in M3), I find that case esoteric and at least the code does not produce false flags that lead to memory nuking.
Any version would probably do when handling just randomly occurring bitflips. But if I successfuly can sift through uninitialized RAM, it makes a robust impression.
I didn't use the HAL interrupt handler on purpose; it clears the flags before calling the user handler and I would have to find the instance of my structure twice.
In case the described problems are a hardware bug: My device is a STMH753AII rev. V (DBGMCU_IDC = 0x20046450)
@theHolgi sorry I couldn't reply earlier, I was ill last week. I ended up working out the "interwoven" nature of DTCM RAMECC and ended up with code similar to yours (although in the form of a Zephyr device driver).
I was never able to test single error handling properly since I only ever saw double errors from reading uninitialised RAM. So I never saw the interesting problems you've found with 32-bit accesses to DTCM.
Hi @andy_long
1. DTCM-RAM is a specific RAM.
The Release Manual (RM0433 for the STM32H743) explains that the DTCM bus is 2x32-bit and why it is so:
"The 2x32-bit DTCM bus allows load/load and load/store instruction pairs to be dual-issued on the DTCM memory."
The DTCM-RAM content is interleaved in D0TCM and D1TCM.
The AN5342 - section 3.1.3 Interpreting FAR - explains how to calculate the failing address in this case: the word size is 8 bytes. The 64 bits which are read at a given address are composed of 32 bits from D0TCM followed by 32 bits from D1TCM. Each 32-bit data has its own ECC.
In your case, depending on which of monitor 3 (D0TCM) or monitor 4 (D1TCM) has signalled the error, the 32-bit failing address is:
- for D0TCM: 0x2000 0000 + 0x3B22 * 8 = 0x2001 D910
- for D1TCM: 0x2000 0004 + 0x3B22 * 8 = 0X2001 D914
2. The ECC error code given in RAMECC_MxFECR is based on SECDED algorithm-Hamming code. This register can help to determine the error position, in the 32-bit data in the DTCM-RAM case.
For more details regarding position detection, please refer to the Hamming code standard.
3. AN5342 - section 3.1.2 ECC ISR - gives a possible scenario which consists in rewriting the failing address data.
In case of double bit error, no correction is possible, the data is lost. It depends on your application if the data is recoverable from acquiring it again, calculating it again or getting it from a second storage area.
For example, if it is a sensible data in a safety application, you would store the same data or the complemented data in a separate area.
In case of single bit error, the error is only corrected on the data which has been read in the memory. But it remains in the original location. It is recommended to re-write the faulty location in order to remove the single bit error and to avoid the risk of a double bit error.
Hi @KORourke
In STM32H743 or other STM32H7 series, only DTCM has this interleave specificity.
The AN5342 has a special focus on the STM32H5 and STM32H7 series microcontrollers but doesn't concern only these microcontrollers. It is also dealing with flash ECC. "other memories, which are interleaved" does not refer to the STM32H7 but refers to e.g. the dual-bank flash of the STM32G4.
The STM32H743 has 2 monitors in RAMECC2 for SRAM1 (monitor 1 and monitor 2) and SRAM2 (monitor 3 and monitor 4), each monitor covering 64 Kbytes of the SRAM. But in this case, it just has an impact on the start address.
- monitor 1 for SRAM1_0: RAM failing address = 0x3000 0000 + FADD * 4
- monitor 2 for SRAM1_1: RAM failing address = 0x3001 0000 + FADD * 4
- monitor 3 for SRAM2_0: RAM failing address = 0x3002 0000 + FADD * 4
- monitor 2 for SRAM1_1: RAM failing address = 0x3003 0000 + FADD * 4
Hi @theHolgi
RAMECC1_M1SR is the status register of RAMECC1 monitor 1 = 0x52009024. This is the AXI SRAM monitor.
As RAMECC1_M1FAR (0x52009028) = 0x3ffd, the ECC failure is in AXI SRAM at address:
RAM failure address = 0x2400 0000 + 0x3ffd * 8 = 0x2401 FFE8.
The Failure address read in the FAR register is directly usable. You can forget all your mask stuff; it's not needed at all.
OK, let's do it one more time.
The ECC interrupt occurs; scanning the state registers reveals a 1 at address 0x52009044 (single bit failure).
This belongs to D1 domain (RAMECC1) monitor 2 (address offset 0x40), which is, according to RM0433 Table 11, assigned to ITCM-RAM which is 64 bit.
FAR at 0x52009048 reveals 0x3F9C.
RAM address = ITCM base + (8 * FAR) = 0x00000000 + (8 * 0x3F9C) = 0x001FCE0
This is an invalid address. The last valid address of ITCM would be 0x0000FFFF.
On the other hand, assuming that only 13 bits of FAR can possibly be valid, the content would be 0x1F9C which results in RAM address 0x0000FCE0, which fits pretty well when comparing the content of the two failing data registers and the observed RAM content from Memory monitor.
So the question still is, why is there sometimes a stray extra bit set. (I've never observed it at a higher position).
But obviously it can just be ignored. I have not yet observed this behavior for any other monitor than ITCM, actually.
Hi @theHolgi
From which IDE does this screenshot come from? And which version of the tool? On which OS?
Which debugger do you use? (I've found a picture of STM32CubeIDE on MacOS with JLink displaying the SFRs in the memory view: from this I would guess that you are using JLink, right?)
I don't have this view with STM32CubeIDE version 1.16.0 and STLink-V3 debugger. I have a separate SFR registers view as shown in the picture below.
There is no RAMECC_M0xx registers; it starts with RAMECC_M1xx for monitor 1.
It seems there is a shift in your view, with the values of monitor 1 displayed with the addresses of monitor 2.
Moreover, the RAMECC_M1FDRL = 0xA3837F0E. In memory, the value at @0x0000FCE0 is 0xA3837F1E, which is a bit different.
Have you verified that you don't find the content of the FDRL and FDRH registers in AXI SRAM?
Hi @Christophe VRIGNAUD ,
I am using Eclipse on Linux; the connection is with JLink so it is using JLinkGDBServer on the host side.
The SFR view is determined by the SVD file you use. The official SVD has the ECC units like you are showing, but it is missing other peripherals so I am using another SVD that I found on the internet. Should make my own best-of...
Anyway, that does not invalidate the registers' content and is why I was always citing the register addresses to remove ambiguity.
About the difference between FDR and memory view, that's actually kind of interesting since with the single ECC error that is signaled, the content should still be unambiguous even on a second reading, which the debugger connection will do.
My overall imression is that the hardware will not like multiple reads on failing memory, or follow-up errors on a different address (the memory view reads a fairly large section of uninitialized RAM), and therefore I am generous about 1 bit difference to the expected value.
I would make sure to NOT have any memory view open before you reach the fault handler and have written down the peripheral's registers, and then trust the memory view only for plausibility cross-reference.