STM32F7 Bus Fault with address in nowhere-land

LCE · ‎2022-10-04

Heyho,

after having found the PHY problem I can stream some audio via TCP.

But when I don't get the data out fast enough (e.g. TCP ACK takes too long so that there would be a buffer overflow), I stop the transfer and send some UART messages.

Without having changed anything signifcant, suddenly I get a Hard Fault in the midst of these UART messages.

I started the CubeIDE debugger and found that it is a Bus Fault at address 0x43817a3b, error PRECISERR:

CFSR = 0x8200	-> PRECISERR, BFAR valid
BFAR = 0x43817a3b
 
ARM:
A data bus error has occurred, and the PC value stacked for the exception return points to the instruction that caused the fault.
When the processor sets this bit to 1, it writes the faulting address to the BFAR.

STM32F767, custom board and Nucleo-144 same behavior

bare metal SAI DMAs -> ETH-DMA via lwIP TCP

streaming works with e.g. 25.6 Mbit/s

BIG DMA SAI buffers are used: 240 x 1440 B = 337.5 kB

The Hard Fault occurs somewhere after line 13 of the following code in main():

	if( u8a2ipActive == 1 &&
		((u32a2ipTxErrCnt > 100) || (u32ErrCntPktSkipped > 0)) )
	{
		uart_printf("# ERR: TCP Streaming\n\r");
		uart_printf("u32a2ipBnWrErrCnt   = %ld\n\r", u32a2ipBnWrErrCnt);
		uart_printf("u32a2ipBnFreeErrCnt = %ld\n\r", u32a2ipBnFreeErrCnt);
		uart_printf("u32ErrCntPktSkipped = %ld\n\r", u32ErrCntPktSkipped);
		if( psGlobTcpStreamState->pPcb != NULL )
		{
			uart_printf("pPcb->snd_queuelen = %d\n\r", (uint16_t)psGlobTcpStreamState->pPcb->snd_queuelen);
			uart_printf("pPcb->snd_buf = %d\n\r", (uint16_t)psGlobTcpStreamState->pPcb->snd_buf);
			uart_printf("pPcb->dupacks = %d\n\r", (uint16_t)psGlobTcpStreamState->pPcb->dupacks);
			uart_printf("pPcb->unsent->p->ref = %d\n\r", (uint16_t)psGlobTcpStreamState->pPcb->unsent->p->ref);
			uart_printf("pPcb->unacked->p->ref = %d\n\r", (uint16_t)psGlobTcpStreamState->pPcb->unacked->p->ref);
		}
		uart_printf("SAI/DMA stop\n\r");
		if( HAL_OK != SAI_RX_stop_DMA() ) uart_printf("# ERR: SAI_RX_stop_DMA()\n\r");
...

So, I actually do not understand ARM's error description,

and I don't know where that address is: according to the STM32F767 memory map, the addr 0x43817a3b is somewhere in "Reserved" between AHB1 and AHB2.

Maybe there are some SRAM borders crossed due to the big DMA buffers? That shouldn't be the cause I guess, then the hard fault would occur sooner and not after having used is thousands of times.

Anybody any ideas, please?

SMacl.1 · ‎2022-10-04

Can you supply a more complete register dump? Seeing LR and how it relates to PC is often useful. R7 ditto, the frame pointer. It is surprisingly easy to put a bad/confusing value in PC via simple stack corruption involving pushed LR and R7.

I have done some work in this area, if you are interested:

https://github.com/tobermory/faultHandling-cortex-m

LCE · ‎2022-10-04

Thanks for your reply!

Okay, now I still get:

CFSR = 0x8200	-> PRECISERR, BFAR valid
with a different address, still in "Reserved"
BFAR = 0x4bd95cc
 
The CPU registers:
LR = -23d / 0xffffffe9
SP = 0x2007ff20
R7 = 0x2007ff88
PC = 0x800e328 <HardFault_Handler>

... which doesn't tell me anything! :see_no_evil_monkey: Yet...

Never stop learning - so onto the ARM CPU stuff...

Edit:

now it hangs in

BFAR = 0x12a1324b

another "Reserved" memory area

LCE · ‎2022-10-06

So... I think I found the mistake I made:

when changing the number of ETH descriptors, I needed to increase the reseved memory area in the linker script.

And I probably forgot to update / reduce the size of the "normal" SRAM:

STM32F767xITx_FLASH.ld
 
/* Specify the memory areas */
MEMORY
{
	FLASH (rx)		: ORIGIN = 0x08000000, LENGTH = 2048K
	RAM (xrw)		: ORIGIN = 0x20000000, LENGTH = 0x7BD00		/* ~500K */ <<<=== FORGOT to update / reduce
	Memory_B0(xrw)	: ORIGIN = 0x2007BD00, LENGTH = 0x00100		/* SRAM for no init vars */
	Memory_B1(xrw)	: ORIGIN = 0x2007BE00, LENGTH = 0x00600		/* SRAM for ETH RX descriptors */
	Memory_B2(xrw)	: ORIGIN = 0x2007C400, LENGTH = 0x02C00		/* SRAM for ETH TX descriptors */
}

Could this have been the reason for the hard fault?

LCE · ‎2022-10-06

Anyway, again I learned a lot, here's how I did some hard fault debugging without the debugger.

As mentioned above by @SMacl.1 saving some registers at hard fault to a no init SRAM area, then printing these after reset.

parts of linker script:
/* Specify the memory areas */
MEMORY
{
	FLASH (rx)		: ORIGIN = 0x08000000, LENGTH = 2048K
	RAM (xrw)		: ORIGIN = 0x20000000, LENGTH = 0x7BD00		/* ~500K */
	Memory_B0(xrw)	: ORIGIN = 0x2007BD00, LENGTH = 0x00100		/* SRAM for no init vars */
	Memory_B1(xrw)	: ORIGIN = 0x2007BE00, LENGTH = 0x00600		/* SRAM for ETH RX descriptors */
	Memory_B2(xrw)	: ORIGIN = 0x2007C400, LENGTH = 0x02C00		/* SRAM for ETH TX descriptors */
}
 
SECTIONS
{
...
	/* no init section */
	.ARM.attributes 0 : { *(.ARM.attributes) }
	.NoInitSection (NOLOAD) : { *(.VarNoInitSection) } >Memory_B0
}

hard fault handler:

void HardFault_Handler(void)
{
	uint32_t u32ResetDelay = 0;
 
	/* save some registers */
 
	register uint32_t temp0 asm("r0");
	register uint32_t temp1 asm("r1");
	register uint32_t temp2 asm("r2");
	register uint32_t temp3 asm("r3");
	register uint32_t temp4 asm("r4");
	register uint32_t temp5 asm("r5");
	register uint32_t temp6 asm("r6");
	register uint32_t temp7 asm("r7");
 
	register uint32_t tempSP asm("sp");		/* stack pointer = R13 */
	register uint32_t tempLR asm("lr");		/* link register = R14 */
 
	/* program counter PC = R15 is a little harder to get: */
	register uint32_t tempPC;
	asm("mov pc,%0":"=r"(tempPC));
 
	u32FaultCFSR 	= SCB->CFSR;
	u32FaultMMFAR 	= SCB->MMFAR;
	u32FaultBFAR 	= SCB->BFAR;
 
	u32FaultCpuR[0] = temp0;
	u32FaultCpuR[1] = temp1;
	u32FaultCpuR[2] = temp2;
	u32FaultCpuR[3] = temp3;
	u32FaultCpuR[4] = temp4;
	u32FaultCpuR[5] = temp5;
	u32FaultCpuR[6] = temp6;
	u32FaultCpuR[7] = temp7;
 
	u32FaultCpuSP = tempSP;
	u32FaultCpuLR = tempLR;
	u32FaultCpuPC = tempPC;
 
	u32FaultEthTxErr[0] = u32ErrCntTcpSndBuf;
	u32FaultEthTxErr[1] = u32ErrCntTcpWrite;
	u32FaultEthTxErr[2] = u32ErrCntTcpOut;
	u32FaultEthTxErr[3] = u32ErrCntTcpTxAct;
	u32FaultEthTxErr[4] = u32ErrCntTxTimeout;
	u32FaultEthTxErr[5] = u32ErrCntTxRetry;
	u32FaultEthTxErr[6] = u32ErrCntPktSkipped;
	u32FaultEthTxErr[7]++;
 
	u32FaultA2IpPktTxd = u32a2ipPktNum;
	u32FaultA2IpTcpWr  = u32A2IpTcpWrCnt;
 
	u32FaultEthTxECode = u32EthTxErrorCode;
	u32FaultEthTxELtst = u32EthErrTxLatest;
 
	/* output, try, might hang... */
	HAL_UART_Transmit_DMA(&huart3, (uint8_t *)szHardFault, strlen(szHardFault));
 
/* RESET */
	while( u32ResetDelay++ < (uint32_t)100E6 );
	NVIC_SystemReset();
 
	while( 1 )
	{
	}
}

and the global variables declaration:

/* FAULT handling stuff
 *	-> do NOT initialize!
 */
uint32_t u32FaultCFSR	__attribute__((section(".NoInitSection")));
uint32_t u32FaultHFSR 	__attribute__((section(".NoInitSection")));
uint32_t u32FaultMMFAR 	__attribute__((section(".NoInitSection")));
uint32_t u32FaultBFAR 	__attribute__((section(".NoInitSection")));
 
uint32_t u32FaultCpuLR 	__attribute__((section(".NoInitSection")));
uint32_t u32FaultCpuSP 	__attribute__((section(".NoInitSection")));
uint32_t u32FaultCpuPC 	__attribute__((section(".NoInitSection")));
 
uint32_t u32FaultCpuR[DEBUG_FAULT_R_NUM] 	__attribute__((section(".NoInitSection")));
 
uint32_t u32FaultEthTxErr[DEBUG_FAULT_TX_ERR_NUM] __attribute__((section(".NoInitSection")));
 
uint32_t u32FaultA2IpPktTxd	__attribute__((section(".NoInitSection")));
uint32_t u32FaultA2IpTcpWr	__attribute__((section(".NoInitSection")));
 
uint32_t u32FaultEthTxECode	__attribute__((section(".NoInitSection")));
uint32_t u32FaultEthTxELtst	__attribute__((section(".NoInitSection")));
 
const char szHardFault[] = "\n\r#### ERR ####\n\rHard Fault - Deadlock - Reset...\n\r\t\"cr Se\" after restart\n\r";

SMacl.1 · ‎2022-10-07

Your fault handler won't give you the register values you want. For example, you are recording the PC value within the fault handler itself. The PC at actual fault time is pushed onto the stack, one register of eight, by the hardware, just before the fault handler is called. You just have to work out WHICH stack, msp or psp. If you look at my fault handler code, the asm part works out the correct stack, via inspection of LR, and calls the C part, which locates faulting PC from that stack.

FYI I am adding a 'quiz' to my repo, where they'll be various fault dumps and various C snippets that produced them and you have to match them.

SMacl.1 · ‎2022-10-07

To be more specific, faulting PC is located at line 277 of

https://github.com/tobermory/faultHandling-cortex-m/blob/main/src/main/c/faultHandling.c

LCE · ‎2022-10-10

@SMacl.1

Thanks again for your reply!

At least I found that one too, that I only got the current PC = hard fault handler.

I'll check your code again.