STM32H7 SRAM ECC Triggered unexpectedly

ChrisO · ‎2025-05-16

Hi,

I am testing the ECC on DTCM on the STM32H7A3ZI. At start of day I initialize all of DTCM apart from a single test location. I have a test which reads from that uninitialized location and successfully provokes the SRAM ECC interrupt, correctly reporting the address for the uninitialized location. (I am following application note AN5342).

However, I am finding that the same ECC error is raised in other circumstances, when I am confident that my code is not reading the test location. This happens when the IDE is not connected, so it is not the IDE which is reading that location.

I have another test that provokes flash ECC errors, and it is currently the code of this test that somehow provokes the SRAM ECC error as well as the flash ECC error. However I don't think this code is reading the uninitialized location (and I have also seen other code provoke the problem). The address reported in the FAR register is always that of my uninitialized test location.

I could imagine that the processor might read that location if cacheing or speculative read-ahead operations were happening, but I believe that no such cacheing would apply to DTCM.

Could you give any ideas about why this might be happening?

Thanks,

Chris

mƎALLEm · ‎2025-05-16

Hello @ChrisO and welcome to the ST community,

Maybe you need to explain in details what are these circumstances in this statement:

@ChrisO wrote:

However, I am finding that the same ECC error is raised in other circumstances, when I am confident that my code is not reading the test location.

I don't think flash ECC errors has a relation with RAM-ECC. Both ECCs are different. But who knows: what if you disable that flash ECC test?

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

ChrisO · ‎2025-05-19

Hello @mƎALLEm ,

I fear if I try to explain the "other circumstances" we may go down a rabbit hole talking about code which is not relevant.. Let's talk about what I am experiencing right now, which is that the flash ECC test provokes the DTCM ECC error, apparently by accessing my uninitialized DTCM location.

The problem is observed by running the flash ECC test, so if I disable this test or simply don't run it, I don't observe the problem.

The flash ECC test works by making 2 consecutive program operations to a region of flash (bank 2, sector 127), changing 1 bit of data between the two writes. It then reads back from the flash; but before doing this it invalidates the cache by running SCB_InvalidateDCache_by_Addr() before it reads. This is necessary to ensure that the read operation actually reaches the memory.

When I run this test, the flash ECC error is correctly raised, but a DTCM ECC error is also raised on my uninitialized DTCM test location. Randomly, according to the way the board powers up, this may be a 1-bit or 2-bit error.

- if I comment out the call to SCB_InvalidateDCache_by_Addr(), the problem goes away.

- if I disable the cache completely (by commenting out the call to SCB_EnableDCache() in the initialization code), the problem goes away.

- if I define an MPU region covering the flash address range and set it to be non-cacheable, the problem goes away.

- if I define an MPU region covering the DTCM address range and set it to be non-cacheable, this makes no difference to the problem (as expected, because DTCM is never cached, as stated in the documentation).

So there seems to be strong evidence that the flash cache is causing reads to DTCM? This seems completely inexplicable...

Thanks,

Chris

waclawek.jan · ‎2025-05-19

What exactly are the symptoms and how do you observe them? Do you observe *both* the FLASH and the RAM ECC error at the same time? Are the addresses in any way correlated (e.g. all last N bits 1 or similar)?

And what happens if you modify the FLASH ECC test so that it does not perform the writes to the FLASH, but performs the cache invalidation? Is actual FLASH read necessary to invoke the problem?

JW

ChrisO · ‎2025-05-19

Yes, when I provoke the flash error it first hits the flash ECC IRQ handler then hits the SRAM ECC IRQ handler. Each of these will record the fact that the error happened along with the failing address information.

Currently the flash test location is 0x081FE000 and the DTCM test location is 0x2001FBE0. I have tried moving the DTCM test location around, that makes no difference to the problem and the DTCM test location is always correctly reported.

The problem is not observed if I do nothing but cache invalidation in the flash test, but I don't have to do the whole flash test to provoke it. It seems that doing the lock and unlock operations along with the invalidation is sufficient. But small and apparently irrelevant code changes (such as making a variable volatile or not) can affect the problem. Also, the problem is reproduced if I optimise for size but not if I optimise for debug.

You may be suspecting stack overflow but I am pretty confident I have plenty of space on all my stacks, and the fact that the the reported address moves as the test location moves would seem to be evidence against that.

Also: I have tried putting a data watchpoint (configured for read) on the DTCM test location. This is hit when I perform the DTCM ECC test, but not when I provoke the error via the flash ECC test.

Thanks,

Chris

waclawek.jan · ‎2025-05-19

I have no more questions/ideas at this point.

For the ST crew to investigate, they may want you to prepare a minimal but complete example together with the complied binary/elf (given the issue is prone to go away with changing particularities).

JW

ChrisO · ‎2025-05-21

OK, I can't send my codebase, but will see if I can create similar problems with a simple example.

Attached is a project for the Nucleo-H7A3ZI-Q board and STM32CubeIDE; it is just the reference code with some small additions under user_src and in the linker script (STM32H7A3ZITXQ_FLASH.d) and the startup code (startup_stm32h7a3zitxq.s). All it is doing is running my function ECC_Init. If you breakpoint in ECC_IRQHandler you will see this handler is being hit. It is reporting an error location in ITCM. By default, none of the DTCM or ITCM is being initialized in this test; but also, I don't think any DTCM or ITCM should be in use: according to the linker script, everything is laid out in flash and RAM. So I don't understand how this error arose.

If you change #if 0 to #if 1 on line 65 of startup_stm32h7a3zitxq.s then this will put in place the startup code to initialize all of ITCM and DTCM, and the error will not occur. After changing this back to 0 you would need to run (or at least load) the updated code, then stop and re-power, to return the TCMs to uninitialized state, before the error could be reproduced again.

I don't know if this test correctly relates to my original problem, but it seems at least a little similar, and if I can understand what is going on here, maybe it will help me to understand the original problem.

Thanks,

Chris

ChrisO · ‎2025-06-10

Hi,

It would be great if someone from ST could look at my example case above as this remains an unsolved mystery. Is this forum the best place for making that happen, or should I raise a ticket in some other way?

Thanks,

Chris