cancel
Showing results for 
Search instead for 
Did you mean: 

Reproducible ECC error in TCM memory, following a certain access pattern

robojan
Associate

Hey all,

I have been debugging a nasty issue that I have been having over the last couple of days. I am using the STM32n657I0 MCU. The issue that I observed were random crashes in my code. Luckily the problem was well reproducible. What I observed was that sometimes the ITCM memory was suddenly not accessible anymore, or some data had changed. To make certain that my code was not modifying the ITCM contents I configured the MPU to disallow write operations to the ITCM memory. Also I am not yet using DMA or anything like that, which could write to the ITCM. 

After the bug triggers, I cannot read the ITCM memory anymore with the debugger or the processor, all other memories are still working and actually execute, which prints a nice crash log to serial. 

When you encounter failing memory or other strange issues you would think maybe the core is unstable. So, I tried to decrease the frequency, which did not change the behavior at all. (I went from 800MHz to 200MHz). I also noticed that I was not setting VOS High. So, I changed the voltage scaling. This also didn't change the behavior at all. 

After a lot of investigation I could write a minimal piece of assembly that could reliably trigger the issue. The assembly consists out of 2 instructions, which will load memory twice.

    .section .test.trigger_bug_minimal,"ax",%progbits
    .global trigger_bug_minimal
    .type trigger_bug_minimal, %function
    .align 2
    // This function must be placed in the Flex-ITCM area, but the exact location doesn't seem to matter (0x10010000 - 0x1003FFFF)
    // In my tests it was placed at address 0x10011d70, 0x10031a18
    // When the function was placed at 0x10000350, the bug could NOT be triggered.
trigger_bug_minimal:
    // We must load some data from the ITCM area to trigger the bug. The offset doesn't seem to matter, I tried 0-24 and all triggered the bug.
    // The address that we load from here will trigger the ECC error. i.e. load from 0x10031a1c, will trigger an ECC error in DTCM0 at location 0x31a1c. 
    // It moves with the offset, but it will always trigger the ECC error.
    // Sometimes the ECC error is single bit, sometimes double bit. When one firmware version triggers single bit, it will always trigger single bit errors. 
    // The same for double bit errors.
    // If I don't catch the ECC error interrupt, a busfault occurs and we will not reach the next instruction.
    ldr	r0, [pc, #0] // The firmware crashes at this line.
    // We must call to some function in the Flex-ITCM area. 
    // The exact location doesn't seem to matter, but it needs to be in the Flex-ITCM area.
    // The contents of the function doesn't seem to matter either.
    // Both b and bl trigger the bug. also ldr r2,[r1, #0] triggers the bug, where r1 contains the wrong address.
    // When this function is placed at 0x10011d70, the following observations were made:
    // The following addresses all trigger the bug: 0x10020000, 0x10030000, 0x10031a20
    // The following addresses do NOT trigger the bug: 0x10010000, 0x10018000, 0x10000000, 0x1001fffc, 0x1001fffe
    // When this function is placed at 0x10031a18, the following observations were made:
    // The following addresses all trigger the bug: 0x10020000, 0x10010000
    // The following addresses do NOT trigger the bug: 0x10030000, 0x10000000
    b	remote_function
    .size trigger_bug_minimal, .-trigger_bug_minimal

I annotated this code with my observations which I used to narrow down where the problem could be. Running the code or single stepping through it with a debugger, both trigger the issue. And even when single stepping with a debugger it fails on the ldr instruction. 

This narrowed the issue down to a problem with the ITCM memory, when it has been expanded with the FLEXRAM. I have configured the flexram to give me 256KB of ITCM and 256KB of DTCM. (btw, I do not like that you have to do a POR to apply that config change. On the iMX RT from NXP you didn't have to do that. but nvm.) In normal operation the full 256KB of the ITCM is accessible

My guess is that we have multiple banks of 64KB of FLEXRAM which can be configured for DTCM, ITCM or AxiRAM. When we do 2 consecutive reads from different banks. ie read from 0x10031a1c and 0x10020000 we will trigger an ECC failure and either the memory is corrupted or the ITCM stops working completely. As annotated in the code. 2 reads from the same bank do not give an issue and reads from the fixed ITCM never give an issue. 

So, my questions would be:

1. Is this an issue in the silicon as I suspect?

2. What can I do to work around this?

3. Why does the Cortex M55 report ECC errors in the DTCM0 bank, even though all the memory is in the ITCM?

 

You can find the full source code of my application that triggers the issue here: https://gitlab.com/robojan/micro-robot-extended/microrobot-v3-fw/-/blob/flex-tcm-bug/bsp/microrobot-v3/AppliSecure/Core/Src/crash_entry_armv8m.S?ref_type=heads#L61

I have attached a firmware file that triggers the bug. If needed I could make a GDB script that will load the firmware on the MCU. (As that is a complex sequence of steps, mainly due to a separate fsbl firmware)

 

0 REPLIES 0