Hard faults, but not really == impossible to debug

YvanR · ‎2023-05-08

In a current PCB design, SWD debugging is very erratic. Stepping through the code line-by-line works fine, but executing multiple lines will sometimes result in an apparent crash with a call stack like

0x0806130

0xfffffff9

which doesn't make any sense for a few reasons:

If this were actually a hard fault, the hard fault handler would execute.
The location of the hard fault hander is in tact in the vector table, and the handler itself is in memory. There has not been any corruption. I can see it in the disassembler.
The location that is pointed to by the last stack entry is outside the range of the code footprint in RAM/FLASH.

So I thought maybe SWD comms are sketchy, but I tried slowing them right down to 140 kb/s and they aren't any more reliable.

Has anyone seen this behaviour?

Bob S · ‎2023-05-08

Which CPU? RTOS?

How do you know it is (supposed to be) a hard fault? There are lots of other faults? The 0xfffffff9 is indeed an exception return type. If that is the stack contents then what is the PC? And what do the fault registers tell you? Either dump them or use the CubeIDE fault analyzer (presuming you are using CubeIDE).

If 0x0806130 is outside your (current) code area what is there? And getting there was probably due to corrupted stack or pointer.

If this is a custom board, double check the power supplies, decoupling caps. And depending on the CPU VCAP_1 and VCAP_2, PDR_ON, BYPASS_REG, BOOT0, etc.

YvanR · ‎2023-05-09

Thanks for the guidance, Bob.

STM32F411CEU6

FreeRTOS

The Fault Analyzer is reporting a (FORCED) and (IBUSERR) with PC/PSP at 0x0.

Unfortunately the fault is non-deterministic, which is the debugging challenge.

I can step up until osKernelStart(); but the breakpoints on the first lines of my tasks never get used.

The power supply, whether from the SWD debugger or the buck converter is a very clean 3V3 with no discernable ripple. There are 4 decoupling caps, one on each side 2mm from the chip, and VCAP is 4.7uF, 4 mm from the chip. BOOT0 is pulled low with 15k.

I did add a handful of testpoints for bed-of-nails testing, and I'm a bit concerned that these are adding parasitics and/or acting as antennae.

I'll do some more reading on ARM exceptions. This stuff is new to me, as in the past my exception handlers have been called reliably, easing debugging.

Thanks for the ideas.

Bob S · ‎2023-05-09

The "FORCED" flag means the original cause might have been a usage, bus or memory management fault. Those three faults are disabled by default on the F4 series and have to be explicitly enabled. Enable those and see if you get any better information.

// Warning: hand-typed from memory, actual register names may be different
__DSB();
__ISB();
SCB->SHCSR |= (SCB_SHCSR_USGFAULTENA_Msk | SCB_SHCSR_BUSFAULTENA_Msk | SCB_SHCSR_MEMFAULTENA_Msk);
__DSB();
__ISB();

YvanR · ‎2023-05-18

I spent a couple of hours digging into this, but the types and number of faults reported just didn't make sense. Through trial and error I determined that my original hunch was correct; the debugger simply was not acting reliably. When the code executed without SWD debugging, it behaved sanely. I can usually debug at 125kb/s but going any faster results in stops that appear to be faults that are finding a corrupted vector table.

Had this not been firmware that is running reliably on 50+ units this probably would have driven me crazier than it did, chasing null pointers and hard faults, but I've been able to get the code 75% ported to the new board layout (the rev was primarily reassignment of peripheral pins to accommodate a new device) and these faults were not "real".

Many thanks to @Bob S for his ideas and support.

Pavel A. · ‎2023-05-19

Now this is the time to ask, what is the debugger? a ST-LINKv3 or J-Link or some old ST-LINKv2 dongle of unclear origin?