STM32H747 (Portenta H7) and HardFault Crash Report

tjaekel · ‎2024-05-08

My project crashes after a random period when sending network traffic (one hour or longer working). I want to debug (but without a debugger connected or sitting there and waiting). So, I want to record some data, e.g. when the HardFault_Handler was called, to verify on which location in code the fault happened.
In order to do so, here the approach:

if a HardFault_Handler is called - it stores important info (e.g. LR, PC, XPSR, CFSR, ABFSR) in Backup Registers of RTC (to let this data survive a reset: no idea if memories, even Linker Script using NOLOAD attribute on memories, would survive a reset: if SRAMs are power cycled, reset, ... data gets lost, therefore RTC backup registers which are not "touched" on a Reset)
if FW crashes (all dead, indicated by a flashing LED) - I want to press reset and with a next command when back again I want to print the Crash Report (command "cr" via UART)

All works so far, just:

The HardFault_Handler can be asynchronous (or synchronous). If an asynchronous HardFault_Handler happens (and it does for me):

the reported PC is not correct! (a PC "much" later is recorded)
and the ABFSR register recorded tells me: it was an INPRECISEERR (imprecise, as "asynchronous")

More details can be found here:

https://interrupt.memfault.com/blog/cortex-m-hardfault-debug

So, my question:
How to debug an "asynchronous" Hard Fault? How to turn it into a "synchronous" event?
(I tried with disabling the ICache, but not really a difference: the reported PC as location of the crash is a bit later: it can be already in another function call and no idea from where this call was done).

Here, my implementation details:

1. Cause a Hard Fault (by intent):

#if 1
	//force a Hard Fault to check our "cr" command
	unsigned long *addr = (unsigned long *)(0x08000000 + 0x02000000);
	*addr = 0x11223344;					//write to invalid address
#if 1
	{
		int i;
		for (i = 0; i < 100; i++)
			__NOP();					//the Hard Fault Handler comes here, delayed, imprecise!
	}
#endif
#endif

2. Add a Hard Fault Handler and forward to the function recording the "stack frame":

    .section	.text.Default_Handler,"ax",%progbits
Default_Handler:
	TST 	LR, #4
  	ITE 	EQ
  	MRSEQ 	R0, MSP
  	MRSNE 	R0, PSP
  	B		HardFault_Handler_C
Infinite_Loop:
	b	Infinite_Loop
	.size	Default_Handler, .-Default_Handler

Remark: I use it for all handlers defined as "weak", using this Default_Handler, also triggered by a HardFault.

3. The HardFault_Handler_C:

void __USED HardFault_Handler_C(unsigned long *hardfault_args)
{
	uint32_t *rtcBkpReg = (uint32_t *)&RTC_START_BKP_REG;
	rtcBkpReg += 15;			//skip the syscfg
	//stacked_r0 	= hardfault_args[0];
	//stacked_r1 	= hardfault_args[1];
	//stacked_r2 	= hardfault_args[2];
	//stacked_r3 	= hardfault_args[3];
	//stacked_r12 	= hardfault_args[4];

	*rtcBkpReg++ = hardfault_args[5];						//LR
	*rtcBkpReg++ = hardfault_args[6];						//PC
	*rtcBkpReg++ = hardfault_args[7];						//XPSR
	*rtcBkpReg++ = *((unsigned long *)0xE000ED28);			//CFSR
	*rtcBkpReg++ = *((unsigned long *)0xE000EFA8);			//ABFSR
	//*rtcBkpReg = *((unsigned long *)0xE000ED2C);			//HFSR ?

	////NVIC_SystemReset();

	while (1) {
		__NOP();
	}
}

Remark: I store some data from the stack frame in RTC BackUp registers: I want to make sure they will survive a Reset button pressed (not a power cycle which clears RTC backup registers, if long enough interrupted).

So, after Reset I just print what was recorded (via command "cr"):

void SYSCFG_printCrashInfo(EResultOut out)
{
	uint32_t *rtcBkpReg = (uint32_t *)&RTC_START_BKP_REG;
	rtcBkpReg += 15;			//skip the syscfg
	/* we print the RTC backup registers */
	print_log(out, " 0 : 0x%08lx\r\n", *rtcBkpReg++);
	print_log(out, " 1 : 0x%08lx\r\n", *rtcBkpReg++);
	print_log(out, " 2 : 0x%08lx\r\n", *rtcBkpReg++);
	print_log(out, " 3 : 0x%08lx\r\n", *rtcBkpReg++);
	print_log(out, " 4 : 0x%08lx\r\n", *rtcBkpReg++);
}

All works fine in debugger: when I set a breakpoint on the causing code and step through the code - all looks fine, reasonable (and correct).

But when I run "full speed" (without debugger) - the recorded PC is completely different (and does not make sense, it does not help me to find the causing location in code).
I can make it "more reasonable" with the __NOP() Loop right after the causing instruction: now the PC is reported as doing the __NOP()s. So, a clear indication that the HardFault is "asynchronous" (comes "much" later, with a different, but not the causing PC recorded).

How to make the HardFault a synchronous event?