HardFault by stack overflow(maybe) but not break in HardFault_Handler()

Thoufiel · ‎2024-11-18

I am using the STM32H562VGT6 and IAR Workbench.

I have set up a 4kHz interrupt with LPTIM3 and started transmitting 32 bytes of data using HAL_SPI_Transmit_DMA.

If I break and resume execution after starting the timer interrupt, the SP becomes 0xffff ffd8, causing a HardFault error. The call stack only contains (Exception frame), making it untraceable. The stack area is from 0x2000'0f90 to 0x2000'4f8f.

Although HardFault_Handler() is defined in stm32h5xx_it.c, breaking there does not hit during a HardFault.
Breaking in MemManage_Handler or BusFault_Handler also does not hit.

Commenting out HAL_SPI_Transmit_DMA during the 4kHz interrupt prevents the HardFault, but adding macros to check if the SP is out of range within the HAL API does not trigger any response.

#define CHECK_MSP() {\
uint32_t msp; \
__asm volatile ("MRS %0, msp" : "=r" (msp)); \
if (msp < 0x20000f80 || 0x20004f7f < msp) { \
__BKPT(0); \
} \
}

How can I capture the cause of this issue?

Tesla DeLorean · ‎2024-11-18

So CPP? Perhaps issues with stack or heap or constructors.

Not sure how SP goes into 0xFFFFxxxx space. Perhaps PSP/MSP initialization.

Or double faulting.

Perhaps MemManage Handler?

Can you fish in the primary stack for a context frame for a real PC/LR?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

unsigned_char_array · ‎2024-11-19

Have you checked alignment requirements for the DMA? Sometimes source/destination arrays have to be aligned at more than 4 bytes.

I don't know what happens in

m_motors[id].proceed()

Are you sure it doesn't crash there?
Have you tried setting spare gpio IO pins at certain parts of the code to check with a logic analyzer when/where the code crashes.

(Also please use stdint.h int32_t instead of "int" if you intend a specific size integer (such as for clkTbl), it makes code more readable and portable.)

Kudo posts if you have the same problem and kudo replies if the solution works.
Click "Accept as Solution" if a reply solved your problem. If no solution was posted please answer with your own.

Thoufiel · ‎2024-11-19

-Step Execution

I have tried stepping through HAL_SPI_Transmit_DMA at the assembler level using the disassembler, and it exits the function successfully.
After resuming from temporary, I tried to go through function HAL_SPI_Transmit_DMA this way several times and exited the function without error.
(If I unbreak and restart debugging as is, error occured)
If I stop at interrupts dozens of times, maybe I can figure out the code that caused the problem, but I haven't tried it.

-primary stack for a context frame for a real PC/LR

Is the "primary stack for a context frame for a real PC/LR” here the PC/LR value immediately before the error?
Unfortunately, the direct cause of the error has not been identified, so we have not been able to verify the value.

The register after the error is in a previous post in this thread.

-m_motors[id].proceed()

I tried to make assert appear when the value of step is not suitable for clkTbl, but it did not respond, so I do not think this part is a problem.

I have not verified using GPIO, so I will consider it.

Thoufiel · ‎2024-11-19

As it turns out, no progress has been made...

--------------------------------------------------------------------------------------------
Enable Fault at the beginning of the program,

int main(void)
{
    SCB->SHCSR|= SCB_SHCSR_USGFAULTENA_Msk
    | SCB_SHCSR_BUSFAULTENA_Msk
    | SCB_SHCSR_MEMFAULTENA_Msk;

and Handler to set up a break -> no hit

Attached are the CFSR results when the error occurs.

--------------------------------------------------------------------------------------------

unsigned_char_array · ‎2024-11-20

@Thoufiel wrote:
-Step Execution
I have tried stepping through HAL_SPI_Transmit_DMA at the assembler level using the disassembler, and it exits the function successfully.
After resuming from temporary, I tried to go through function HAL_SPI_Transmit_DMA this way several times and exited the function without error.
(If I unbreak and restart debugging as is, error occured)

Tip for this type of problem: increase a counter (uint32_t counter) in the interrupt and when the error occurs read this count value. You can repeat this to see if it consistently fails at the same count. You can then set a counting breakpoint (a breakpoint that will trigger after it has hit that breakpoint x times). If a counting breakpoint doesn't work you can make your own using an if statement with a breakable line of code in that block.

Kudo posts if you have the same problem and kudo replies if the solution works.
Click "Accept as Solution" if a reply solved your problem. If no solution was posted please answer with your own.

Thoufiel · ‎2024-11-20

Progress on the problem:

This time the program was set to start from 0x0803 0000 (= Area B).
After placing the same program in 0x0800 0000 (=Area A), I started the program from Area B and repeated the pause and resume in the debugger.

After restarting, the message “CPU status reset” may appear.
When the program is paused again, the PC is set to less than 0x0803 0000 and not break in the IDE.

At the time of the first report, Area A was filled with 0xff because it was started after all Flash areas were erased,
I thought that a reset caused the program to start from Area A, which resulted in an exception error due to an abnormal value in a register.
I believe this also explains why the HardFault_Handler placed in Area B was not caught.

The ResetHandler break is caught at the start of the program in Area B, but not when the CPU status is reset.
(I am assuming it probably jumps to the ResetHandler in Area A.)

I would like to know if you have any ideas on a better way to find out why the CPU resets, or why it does not jump to the ResetHandler in Area B when it resets.

unsigned_char_array · ‎2024-11-20

@Thoufiel wrote:
I would like to know if you have any ideas on a better way to find out why the CPU resets, or why it does not jump to the ResetHandler in Area B when it resets.

You can read the reset cause from various registers in the CPU. You can use __HAL_RCC_GET_FLAG to check various flags (you have to check them in a certain order since multiple flags can be set.)
The reset jump address is inside the interrupt vector table. If you erase flash it has an invalid address.
You can flash different areas without erasing all flash and you can debug without flashing (just make sure the flashed binary matches the ELF).

Kudo posts if you have the same problem and kudo replies if the solution works.
Click "Accept as Solution" if a reply solved your problem. If no solution was posted please answer with your own.

Thoufiel · ‎2024-11-21

Thank you for your advice.

Debugging a program placed at 0x8000 0000(Area A) also caused a CPU reset, so I checked the RCC flags with ResetHandler.

After CPU reset, bit 29 (IWDGRSTF) and bit 26 (PINRSTF) of RSR were standing.

This was probably due to the fact that they did not pass the break that was set up in IWDG's Reset before the CPU reset occurred.

In my initial investigation I concluded that WatchDog was irrelevant, but I was wrong...

This would explain why the CPU reset occurs only when pausing and resuming in the debugger, without any problem during normal operation (since I had not enabled DBG_IWDG_STOP).

It seems that enabling DBG_IWDG_STOP requires enabling the TrustZone setting, but considering the impact on operation, I would like to take other steps if possible.

Any ideas?

unsigned_char_array · ‎2024-11-21

@Thoufiel wrote:
It seems that enabling DBG_IWDG_STOP requires enabling the TrustZone setting, but considering the impact on operation, I would like to take other steps if possible.

Any ideas?

Kicking/feeding the dog so it doesn't sleep/starve.

Kudo posts if you have the same problem and kudo replies if the solution works.
Click "Accept as Solution" if a reply solved your problem. If no solution was posted please answer with your own.

Tesla DeLorean · ‎2024-11-21

Invalid SP is going to cause near immediate failure, and going to make very difficult to pin point. Might need to have checkpoint values or output character stream to review dynamic flow.

SP most likely to break at content change for RTOS or Handler return.

Errant DMA can trash whole environment and can't be breakpointed. Always range check the configuration on memory writes.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..