Interrupt while in FreeRTOS PendSVHandler later causing NMI

AndreasLOPettersson · ‎2025-01-30

I am using an STM32G473 with FreeRTOS and am having issues with some interrupts sometimes causing an NMI, making the processor hang in the NMI_Handler (until the watchdog resets it if enabled).

The issue occurs sometimes, and I can always eventually trigger it when doing CAN Block Transfers, leading to many CAN Rx interrupts in quick succession.

From what I can see in the Call Stack, every single time I detect the issue, the interrupt seems to arrive/trigger when I am inside of a task switch. Example of the call stack:

So, here we have a task that has called vTaskDelay. The code pointed to in the delay function is portYIELD_WITHIN_API. My understanding is that this will cause a jump to
xPortPendSVHandler. My port of xPortPendSVHandler looks like this:


void xPortPendSVHandler( void )
{
    /* This is a naked function. */

    __asm volatile
    (
        "	mrs r0, psp							\n"
        "	isb									\n"
        "										\n"
        "	ldr	r3, pxCurrentTCBConst			\n"/* Get the location of the current TCB. */
        "	ldr	r2, [r3]						\n"
        "										\n"
        "	tst r14, #0x10						\n"/* Is the task using the FPU context?  If so, push high vfp registers. */
        "	it eq								\n"
        "	vstmdbeq r0!, {s16-s31}				\n"
        "										\n"
        "	stmdb r0!, {r4-r11, r14}			\n"/* Save the core registers. */
        "	str r0, [r2]						\n"/* Save the new top of stack into the first member of the TCB. */
        "										\n"
        "	stmdb sp!, {r0, r3}					\n"
        "	mov r0, %0 							\n"
        "	msr basepri, r0						\n"
        "	dsb									\n"
        "	isb									\n"
        "	bl vTaskSwitchContext				\n"
        "	mov r0, #0							\n"
        "	msr basepri, r0						\n"
        "	ldmia sp!, {r0, r3}					\n"
        "										\n"
        "	ldr r1, [r3]						\n"/* The first item in pxCurrentTCB is the task top of stack. */
        "	ldr r0, [r1]						\n"
        "										\n"
        "	ldmia r0!, {r4-r11, r14}			\n"/* Pop the core registers. */
        "										\n"
        "	tst r14, #0x10						\n"/* Is the task using the FPU context?  If so, pop the high vfp registers too. */
        "	it eq								\n"
        "	vldmiaeq r0!, {s16-s31}				\n"
        "										\n"
        "	msr psp, r0							\n"
        "	isb									\n"
        "										\n"
        #ifdef WORKAROUND_PMU_CM001 /* XMC4000 specific errata workaround. */
            #if WORKAROUND_PMU_CM001 == 1
                "			push { r14 }				\n"
                "			pop { pc }					\n"
            #endif
        #endif
        "										\n"
        "	bx r14								\n"
        "										\n"
        "	.align 4							\n"
        "pxCurrentTCBConst: .word pxCurrentTCB	\n"
        ::"i" ( configMAX_SYSCALL_INTERRUPT_PRIORITY )
    );
}

Which asm instruction we are at changes from run to run, but in this case we are at

stmdb r0!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}

Here it looks like we get a CAN Rx interrupt. We end up in the handler for that interrupt, and traverse down to where it wants to put some data into a queue. The queue is very large and therefore will not be full. The memcpy is referred to in the call stack is the copy of the data into the write position of the queue.

The variables/pointers used in the memcpy looks to be OK.

After the memcpy we seem to end up in the NMI handler. I do not know why. I have tried to modify the NMI handler to see what the issue is but without luck. My NMI handler:


void NMI_Handler(void)
{
    if (SYSCFG->CFGR2 & 0x100)
    {
        /* SRAM parity err */
        while (1) {
            __asm("bkpt 5");
        }
    }
    if (FLASH->ECCR & 0xf0000000)
    {
        /* FLASH ECC err */
        while (1) {
            __asm("bkpt 6");
        }
    }
    if (RCC->CIFR)
    {
        /* CSS err */
        while (1) {
            __asm("bkpt 7");
        }
    }

    while (1) {
        __asm("bkpt 8");
    }
}

I always end up in the "bkpt 8". The if-cases are designed from what I could read in the reference manual, but they might be incorrect.

Contents of MSP and PSP when the bkpt8 has been hit:


x/128x $msp
0x2001fea8:	0x20004c70	0x2001fefc	0x00000008	0x00000000
0x2001feb8:	0x20004c6c	0x0800682b	0x08001560	0x21000025
0x2001fec8:	0x200049a8	0x00000000	0xffffffff	0x08006d95
0x2001fed8:	0x00000000	0x40006400	0x2221201f	0x2001ff3c
0x2001fee8:	0x00000000	0x4000a4b0	0x00000000	0x08008de1
0x2001fef8:	0x00000000	0x00000000	0x00000601	0x3c080000
0x2001ff08:	0x1f1e1d1c	0x00222120	0xffffffff	0x20000170
0x2001ff18:	0x00000000	0x00000101	0x00000000	0xa5a5a5a5
0x2001ff28:	0x2001b760	0x2001b820	0x2001b860	0x08008ec1
0x2001ff38:	0xffffffff	0x00000000	0x00000000	0xffffffe1
0x2001ff48:	0x200135ec	0x2001371c	0x20013718	0x2000034c
0x2001ff58:	0x200001c8	0xffffffed	0x08007c36	0x6100700e
0x2001ff68:	0x00000601	0x3b080000	0x18171615	0x081b1a19
0x2001ff78:	0x2000034c	0x20000178	0x200001b0	0x200005c0
0x2001ff88:	0x200001b0	0x00000040	0x00000000	0x200005c0
0x2001ff98:	0x200001b0	0x20000198	0x200001b4	0x08007c55
0x2001ffa8:	0x200025d4	0x2000034c	0x00000000	0x00f00000
0x2001ffb8:	0x00000000	0xc0000000	0x08013593	0x08007db5
0x2001ffc8:	0x08007b0c	0x61000000	0x00000000	0x00000000
0x2001ffd8:	0x00000000	0x00000000	0x00000000	0x00000000
0x2001ffe8:	0x00000000	0x00000000	0x00000000	0x00000000
0x2001fff8:	0x00000010	0x08000895


x/128x $psp
0x20013650 <ucHeap+78504>:	0x00000000	0x2001371c	0x10000000	0xe000e000
0x20013660 <ucHeap+78520>:	0x200001c8	0x08005a53	0x08005d80	0x61000000
0x20013670 <ucHeap+78536>:	0x00000000	0x00000000	0x00000000	0x00000000
0x20013680 <ucHeap+78552>:	0x00000000	0x00000000	0x00000000	0x00000000
0x20013690 <ucHeap+78568>:	0x00000000	0x00000000	0x00000000	0x00000000
0x200136a0 <ucHeap+78584>:	0x00000000	0x00000000	0x00200000	0x00600000
0x200136b0 <ucHeap+78600>:	0x00000000	0x08005dc7	0x2001b820	0x0801146f
0x200136c0 <ucHeap+78616>:	0xa5a5a5a5	0xa5a5a5a5	0xa5a5a5a5	0xa5a5a5a5
0x200136d0 <ucHeap+78632>:	0xa5a5a5a5	0xa5a5a5a5	0xa5a5a5a5	0xfffffffd
0x200136e0 <ucHeap+78648>:	0x00000000	0xa5a5a5a5	0xa5a5a5a5	0xa5a5a5a5
0x200136f0 <ucHeap+78664>:	0xa5a5a5a5	0xa5a5a5a5	0xa5a5a5a5	0x08007b29
0x20013700 <ucHeap+78680>:	0xa5a5a5a5	0xa5a5a5a5	0x00000000	0x00000000
0x20013710 <ucHeap+78696>:	0x00000000	0x80000068	0x200135ec	0x000047ad
0x20013720 <ucHeap+78712>:	0x20000200	0x20000200	0x20013718	0x200001f8
0x20013730 <ucHeap+78728>:	0x0000000b	0x00000000	0x00000000	0x20013718
0x20013740 <ucHeap+78744>:	0x00000000	0x00000005	0x20011708	0x45504d49
0x20013750 <ucHeap+78760>:	0x00000058	0x00000000	0x00000008	0x0000000d
0x20013760 <ucHeap+78776>:	0x00000005	0x00000000	0x00000708	0x00000000
0x20013770 <ucHeap+78792>:	0x00000000	0x00000000	0x200183a0	0x00004c28
0x20013780 <ucHeap+78808>:	0x00000000	0x00000000	0x00000000	0x00000000
0x20013790 <ucHeap+78824>:	0x00000000	0x00000000	0x00000000	0x00000000
0x200137a0 <ucHeap+78840>:	0x00000000	0x00000000	0x00000000	0x00000000
0x200137b0 <ucHeap+78856>:	0x00000000	0x00000000	0x00000000	0x00000000
0x200137c0 <ucHeap+78872>:	0x00000000	0x00000000	0x00000000	0x00000000
0x200137d0 <ucHeap+78888>:	0x00000000	0x00000000	0x00000000	0x00000000
0x200137e0 <ucHeap+78904>:	0x00000000	0x00000000	0x00000000	0x00000000
0x200137f0 <ucHeap+78920>:	0x00000000	0x00000000	0x00000000	0x00000000
0x20013800 <ucHeap+78936>:	0x00000000	0x00000000	0x00000000	0x00000000
0x20013810 <ucHeap+78952>:	0x00000000	0x00000000	0x00000000	0x00000000
0x20013820 <ucHeap+78968>:	0x00000000	0x00000000	0x00000000	0x00000000
0x20013830 <ucHeap+78984>:	0x00000000	0x00000000	0x00000000	0x00000000
0x20013840 <ucHeap+79000>:	0x00000000	0x00000000	0x00000000	0x00000000

The CAN Rx interrupt is the only interrupt enabled, I have even disabled the tick interrupt. I have also disabled CAN Tx interrupts, since the Tx should be so far between that the FIFO/queue should never be full. I have a few different tasks still running in order to service the CAN messages.

Our FreeRTOSConfig.h is attached.

So, my question is split in two:

Is it normal to allow and service an interrupt while doing a task switch? Can this mess up the state of the processor?
What can cause the NMI in the CAN Rx interrupt? From my understanding the stacks look kind of OK, but I don't understand all of the contents.

AndreasLOPettersson · ‎2025-03-15

So I did find a solution. The problem was that the RAM memory was not properly initialized, so the SRAM parity check caused an NMI exception. The weird thing was that no error bit was set though, so it took forever to find it.

The problem arose because we had some structs that was not fully packed, i.e. had some holes in them. In the CAN interrupt we malloc:ed some of these, assigned all the members, but the padding/holes were not initialized. Later we memcpy:d the full struct with padding, and that triggered the parity check/error.

Why this occurred only in the CAN interrupt is mainly because that's where we had the struct with holes, and probably because that at some points used heap that had not been used by any other task before.

To solve the issue, several things could be done. One solution could be to always memset the entirety of a malloc:ed struct. What we chose to do was to set all of the SRAM to 0 at startup. That let us never have to think about the initialization at multiple sites.

Hope this can help someone else!

View solution in original post

TsEor59 · ‎2025-03-10

Hello Andreas,

Do you find any solution for this problem ?

I currently have exactly the same issue on a STM32H573 using FreeRTOS.

My code ends up stuck in an NMI_Handler, each time from different functions but always linked to FreeRTOS or an interrupt :

For debugging purposes, I display in my console the contents of some SCM (System Control Block) registers in order to understand the origin of the NMI_Handler.


void NMI_Handler(void)
{
  /* USER CODE BEGIN NonMaskableInt_IRQn 0 */
  printf("NMI Occurred!\n");
  printf("ICSR:  0x%08lX\n", SCB->ICSR);
  printf("CFSR:  0x%08lX\n", SCB->CFSR);
  printf("HFSR:  0x%08lX\n", SCB->HFSR);
  printf("BFAR:  0x%08lX\n", SCB->BFAR);
  printf("SHCSR: 0x%08lX\n", SCB->SHCSR);
  /* USER CODE END NonMaskableInt_IRQn 0 */
  /* USER CODE BEGIN NonMaskableInt_IRQn 1 */
   while (1)
  {
  }
  /* USER CODE END NonMaskableInt_IRQn 1 */
}

In my case, I get the following result:

The SHCSR=0x00000020 register means that the PendSV exception is active.

The following articles are also interesting :

Debugging Hard Fault & Other Exceptions on ARM Cortex-M3 and ARM Cortex-M4 microcontrollers - FreeRTOS™

How to debug a HardFault on an ARM Cortex-M MCU | Interrupt

Thank you in advance for your reply.

Pavel A. · ‎2025-03-10

So, in the FreeRTOSConfig.h configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY = 4.

What is the interrupt priority of CAN and other handlers that use queues etc. ?

AndreasLOPettersson · ‎2025-03-15

So I did find a solution. The problem was that the RAM memory was not properly initialized, so the SRAM parity check caused an NMI exception. The weird thing was that no error bit was set though, so it took forever to find it.

The problem arose because we had some structs that was not fully packed, i.e. had some holes in them. In the CAN interrupt we malloc:ed some of these, assigned all the members, but the padding/holes were not initialized. Later we memcpy:d the full struct with padding, and that triggered the parity check/error.

Why this occurred only in the CAN interrupt is mainly because that's where we had the struct with holes, and probably because that at some points used heap that had not been used by any other task before.

To solve the issue, several things could be done. One solution could be to always memset the entirety of a malloc:ed struct. What we chose to do was to set all of the SRAM to 0 at startup. That let us never have to think about the initialization at multiple sites.

Hope this can help someone else!