NMI Fault without any obvious fault bits set

turbofish · ‎2025-02-12

Hi everyone,

I'm currently experiencing a strange crash on a STM32G473 and I'm a bit stumped on how to debug it.

First the crash:

The system is a STM32G473 running FreeRTOS V10.5.1. I have a simple FDCAN ISR which takes the incoming CAN frames and pushes it to a FreeRTOS queue for use in user space in a task; nothing fancy except this, just transforming the data to a nicer structure and clearing the FIFO and so on. During high bus loads however, the FDCAN ISR will sometimes fire during a FreeRTOS context switch. When it does, the system crashes, which is a problem in and of itself, but the main issue is that the NMI is triggered and not the Hardfault.

SRAM Parity is enabled, but not Clock Security.

Attached is the NMI Handler. I'm checking every bit in HFSR, CFSR, AFSR, The parity error bit and the flash ecc bit and even the css bit are checked. I'm stepping through the function and checking if anything could have triggered the NMI. None of the relevant bits are set to 1 however. I can clearly see that the memcpy() in the FreeRTOS queue-pushing is triggering the NMI; and the callstack in the debugger is very clear that an FDCAN ISR was triggered during the context switch.

Other things tried: I've checked the Vector Table to see if the NMI handler had ended up at another position, which it hasn't. The hardfault handler works as intended (its a simple while(1) { __asm("nop"); } right now. I've checked the errata and couldn't find anything related to the NMI.

So my question(s) is: How do I properly debug the NMI and why does it trigger instead of a Hardfault? Are there any more registers I need to check to determine why we are in the fault handler?

/**
 * @brief Assembler part of the NMI handler
 *
 * Determine which stack pointer (MSP or PSP) was in use when the system crashed.
 * Put the stack pointer into r0 and call a C function to handle the exception.
 * R0 will be the first argument to the C function and we can unwind the stack
 */
__attribute__((naked)) void NMI_Handler(void)
{
    __asm(
        "TST LR, #4\n"        /* Check EXC_RETURN value in LR */
        "ITE EQ\n"            /* If equal (zero), use MSP; else, use PSP */
        "MRSEQ R0, MSP\n"     /* Move MSP to r0 if LR[2] == 0 */
        "MRSNE R0, PSP\n"     /* Move PSP to r0 if LR[2] != 0 */
        "B nmi_handler_c\n"); /* Branch to the C handler passing r0 (stack pointer) as argument. */
}

/**
 * @brief C-part of the NMI handler
 *
 *  stacked_registers Pointer to the stack
 */
void nmi_handler_c(unsigned int* stacked_registers)
{
    volatile unsigned int hfsr = SCB->HFSR;   /* Hard Fault Status Register */
    volatile unsigned int cfsr = SCB->CFSR;   /* Configurable Fault Status Register */
    volatile unsigned int mmfar = SCB->MMFAR; /* Memory Management Fault Address Register */
    volatile unsigned int bfar = SCB->BFAR;   /* Bus Fault Address Register */
    volatile unsigned int afsr = SCB->AFSR;   /* Aux Fault Address Register */
    volatile unsigned int sram_parity = SYSCFG->CFGR2 & SYSCFG_CFGR2_SPF;
    volatile unsigned int flash_error = FLASH->ECCR & (FLASH_ECCR_ECCD2 | FLASH_ECCR_ECCD);
    volatile unsigned int css_error = RCC->CIFR & (RCC_CIFR_CSSF | RCC_CIFR_LSECSSF);

    // --- SRAM and Flash Parity errors ---
    if (sram_parity)
    {
        // SRAM parity failed
        __asm("nop");
    }

    if (flash_error)
    {
        // Flash ECC error
        __asm("nop");
    }

    if (css_error)
    {
        // Clock Security error
        __asm("nop");
    }

    // --- Memory Management Fault Analysis (CFSR bits 0-7) ---
    if (cfsr & (1 << 0))
    {
        // IACCVIOL: An instruction access violation occurred.
        __asm("nop");
    }
    if (cfsr & (1 << 1))
    {
        // DACCVIOL: A data access violation occurred.
        __asm("nop");
    }
    if (cfsr & (1 << 3))
    {
        // MUNSTKERR: Unstacking error during exception return.
        __asm("nop");
    }
    if (cfsr & (1 << 4))
    {
        // MSTKERR: Stacking error during exception entry.
        __asm("nop");
    }
    if (cfsr & (1 << 5))
    {
        // MLSPERR: Lazy state preservation error occurred.
        __asm("nop");
    }

    if (cfsr & (1 << 7))
    {
        // MMARVALID is set: The MMFAR register holds a valid memory fault address.
        // Check mmfar to see the address that triggered the memory management fault.
        mmfar = mmfar;
        __asm("nop");
    }

    // --- Bus Fault Analysis (CFSR bits 8-15) ---
    if (cfsr & (1 << 8))
    {
        // IBUSERR: An instruction bus error occurred.
        __asm("nop");
    }
    if (cfsr & (1 << 9))
    {
        // PRECISERR: A precise data bus error occurred.
        __asm("nop");
    }
    if (cfsr & (1 << 10))
    {
        // IMPRECISERR: An imprecise data bus error occurred.
        __asm("nop");
    }
    if (cfsr & (1 << 11))
    {
        // UNSTKERR: Unstacking error during exception return (bus fault).
        __asm("nop");
    }
    if (cfsr & (1 << 12))
    {
        // STKERR: Stacking error during exception entry (bus fault).
        __asm("nop");
    }
    if (cfsr & (1 << 13))
    {
        // LSPERR: Lazy state preservation error on bus fault.
        __asm("nop");
    }
    // ---------------------- Bus Fault Analysis ------------------------
    if (cfsr & (1 << 15))
    {
        // BFARVALID is set: The BFAR register holds a valid bus fault address.
        // Check bfar to see the address related to the bus fault.
        bfar = bfar;
        __asm("nop");
    }

    // --- Usage Fault Analysis (CFSR bits 16-31) ---
    if (cfsr & (1 << 16))
    {
        // UNDEFINSTR: An undefined instruction was executed.
        __asm("nop");
    }
    if (cfsr & (1 << 17))
    {
        // INVSTATE: Invalid state occurred (possibly an invalid EPSR value).
        __asm("nop");
    }
    if (cfsr & (1 << 18))
    {
        // INVPC: Invalid PC load; may indicate a bad EXC_RETURN value.
        __asm("nop");
    }
    if (cfsr & (1 << 19))
    {
        // NOCP: Attempted to use a coprocessor that is not present.
        __asm("nop");
    }
    if (cfsr & (1 << 24))
    {
        // UNALIGNED: Unaligned access error occurred.
        __asm("nop");
    }
    if (cfsr & (1 << 25))
    {
        // DIVBYZERO: Division by zero error occurred.
        __asm("nop");
    }

    // --- Hard Fault Status Analysis (HFSR) ---
    if (hfsr & (1 << 1))
    {
        // VECTTBL: Bus fault on vector table read during exception processing.
        __asm("nop");
    }
    if (hfsr & (1 << 30))
    {
        // FORCED: A configurable fault (memory management, bus, or usage fault) escalated to a hard fault.
        __asm("nop");
    }
    __asm("bkpt 1");
}

turbofish · ‎2025-02-18

Hi again,

We managed to solve the issue, but it still leaves some questions.

During the reset handler, where we zero out the BSS, and copy variables from flash to ram; we added a memory check (this was planned anyway) to see if there was any hardware fault. We write 0x55 and 0xAA (to test all bits) over the entire SRAM and then readback to verify that there was nothing funky happening and voila, no more NMI Faults. Parity was enabled in the option bytes from before. The check only adds a few ms to startup time.

From the datasheet:

Its only advised to do this, but it seems to be required in order to not have the NMI. Also, when the crash occurs the SRAM Parity Error Flag (SPF) in the SYSCFG_CFGR2 is most definitely NOT set.

If we turn off the SRAM Parity setting in the optionbytes, the system works as intended even without the RAM test.

So the questions are; is it required or is it advised to initialize the entire SRAM during startup to not have NMIs with parity turned on? And is there a check for a parity error except the Parity Error Flag, or should this maybe be in the errata?

Thanks for all the feedback folks!

View solution in original post

ahsrabrifat · ‎2025-02-12

The crash during memcpy() may occur because of a possible stack overflow or misaligned access.

Check the task stack size and ISR stack size:

uxTaskGetStackHighWaterMark(myTaskHandle);

Increase configMINIMAL_STACK_SIZE and check the FreeRTOS heap settings.

turbofish · ‎2025-02-12

Stack Overflows and Misalignment should result in a hardfault, not an NMI if I understand it correctly. I've had plenty of stack overflows and misalignments in this project and we always end up in a hardfault.

Or do those errors in an ISR automatically get escalated to a NMI instead of a hardfault, and how do you debug them?

AScha.3 · ‎2025-02-12

Ok, but i would just try: make stack and freeRTOS buffer areas bigger ...just as a test.

If nothing changes : you know, it is something else to look for.

If you feel a post has answered your question, please click "Accept as Solution".

turbofish · ‎2025-02-12

I've tried to increase the stacks, not the FreeRTOS buffer areas though. It's quite a big complex systems with lots of peripherals active and lots of ISR firing. It's interesting that its only the FDCAN that messes up the core and only during context switching; smashing the stack should result in a Hardfault?

The test is simply spamming the unit with short CAN frames (DLC=1) with a CAN ID not handled by the application, just to stress-test the ISR. By sending every ms it crashes quite fast. I've tried disabling most if not all of the other tasks running just to see if it made any difference but it didn't.

As I understand it, the only way to reach the NMI is Flash ECC Errors, SRAM Parity Errors, Clock Safety Errors and (this im not too certain about) faults-in-faults, as if you mess up your hardfault handler and generate another hardfault. There has to be some bit set in some register somewhere telling me WHY I'm fiddling around in the NMI, but I can't find it.

AScha.3 · ‎2025-02-12

Right, afaik ECC etc errors should give an Hardfault - but just think: if some area in ram is unintentionally written (maybe by a CAN buffer) and this is used as a jump or offset for a jump , it will jump wherever this points to ...

(I never had exactly this problem, so just guessing...) And still more complex, because an RTOS running.

What i would try , to narrow down the problem:

1. make buffers and memory for CAN task bigger - just to see, any change in behavior , or not. (then set back.)

2. try different speed of stress-test the ISR , to see: is it a timing problem (seems to be something like that).

Maybe you can see: CAN frames at a rate of 2ms working, but at 1ms then crash.

3. Also -if some timing related problem- try compiler options: optimizer -O2 , -ofast, or -O0 , maybe some effect.

+ try changing INT priority for CAN , maybe here a problem .

4. Stop all RTOS action, no task switching, have just this task with CAN handling, same as a "normal" program would run; then check with your stress-test : happens also , or never...

5. Try to see, where is stack and what was there, last actions before going to NMI .

----

I had a strange problem in AzureRTOS, copied a working program there, to make it "better" (multitasking etc.).

But the jump to a (working before !) decoder always giving a hardfault, "imprecise address" or so, but could not find out, why - the error happened before the first instruction there . So what then ?

(With help and ideas from forum members here -> )

Just as a blind attempt i increased the task stack to strange 50kB or so - and no more error !

Then i reduced it...and found: it needs 24KB stack , then no problem any more. Thats why i had "no 1." : make buffers and stack a lot bigger - just to see, good or useless.

If you feel a post has answered your question, please click "Accept as Solution".

turbofish · ‎2025-02-14

Thanks for the reply!

I have some more information which might be relevant (or not).

We've tried to REDUCE the heap in FreeRTOS so it would fit in the first 80Kb of the SRAM (it was much larger than needed). We saw that for some reason the stack pointer (PSP not MSP) was fiddling around the boundary between the first bank of RAM and the second (in the G473 there is 80Kb in the first, 16 in the second and so on...) and we thought that maybe there was an issue with getting an ISR during a context switch in FreeRTOS if the switch was moving around data in that area. Lo and behold, the system didn't crash. So then we've tried to move the heap in RAM so that it sits right across the boundary to trigger the crash, but it still didn't crash. So the placement of the heap seems to "fix" the issue. My gut feeling tells me that we simply mask the issue and haven't resolved it.

We've tried different compiler optimization flags; -O0 doesn't trigger the bug, but -O1 and -O2 does; We've analyzed the assembler and we can see that with more aggressive optimization the compiler will inline some things which will trigger the bug.The crash occurs quite a few instructions into memcpy() as seen in the picture below (using a custom inhouse tool for debugging crashes). The registers are from the stack frame.

We've also tried to move around the various priorities both for the FreeRTOS tasks and the various interrupts; but no luck there. The high water mark for the stack usage in the various tasks are quite far from the allocated stacks; and during context switching and the CAN ISR we should use MSP and not PSP correct? The system stack (for MSP) is 1Kb and we are using about ~300 bytes, so it's quite a lot left.

If we reduce the bus load on CAN, the bug will still trigger, but it will take longer time.

I still can't figure out why NMI is triggered and not Hardfault. Is it by design if you do something silly and forbidden during an ISR you will go to NMI and not Hardfault? I've done several deep dives into the datasheets and can't figure out what the nature of the crash is; usually there is SOME bit set in some register telling me why the crash occured but everything is zeroed out.

Crash

Crash analysis.

Call stack from the debugger.

AScha.3 · ‎2025-02-14

So :

- INT prio - ok.

- O0 vs O2 : something happens - whatever is wrong , shows more on optimized code...

- heap position: changes behavior...

NMI can come from your RAM settings:

or flash:

or CSS:

So check your settings - or change them, to see...effect or not.

If you feel a post has answered your question, please click "Accept as Solution".

turbofish · ‎2025-02-18

Hi again,

We managed to solve the issue, but it still leaves some questions.

During the reset handler, where we zero out the BSS, and copy variables from flash to ram; we added a memory check (this was planned anyway) to see if there was any hardware fault. We write 0x55 and 0xAA (to test all bits) over the entire SRAM and then readback to verify that there was nothing funky happening and voila, no more NMI Faults. Parity was enabled in the option bytes from before. The check only adds a few ms to startup time.

From the datasheet:

Its only advised to do this, but it seems to be required in order to not have the NMI. Also, when the crash occurs the SRAM Parity Error Flag (SPF) in the SYSCFG_CFGR2 is most definitely NOT set.

If we turn off the SRAM Parity setting in the optionbytes, the system works as intended even without the RAM test.

So the questions are; is it required or is it advised to initialize the entire SRAM during startup to not have NMIs with parity turned on? And is there a check for a parity error except the Parity Error Flag, or should this maybe be in the errata?

Thanks for all the feedback folks!