cancel
Showing results for 
Search instead for 
Did you mean: 

How to debug faults which only occur with the debugger disconnected. STM32H7B3I-DK

Garnett.Robert
Senior III

Hi,

I had a difficult problem where my project would run faultlessly when started with the debugger, but would fail when the power was recycled.

I thought that a peripheral was not initializing due to a delay in the power up phase. Putting HAL_Delay of 2000 ms did not fix this and in any event the power was reaching 3.3 V in under 300us.

I suspected a hard fault do I added a GPIO output in the hard fault handler code:

void HardFault_Handler(void)
{
	#if DEBUG_NMI_FAULTS == 1
 
	for(uint32_t i = 1000000; i > 0 ; i--)
	{
		HAL_GPIO_WritePin(ImpulseLED_GPIO_Port, ImpulseLED_Pin, GPIO_PIN_SET);
	}
 
	HAL_GPIO_WritePin(ImpulseLED_GPIO_Port, ImpulseLED_Pin, GPIO_PIN_RESET);
 
	NVIC_SystemReset();
	#endif
}

Power cycling caused the GPIO to go high proving the Hard Fault.

Problem was where was the hard fault occurring and why.

Following the AN4989 App note - STM32 microcontroller debug toolbox I set up Uart 1 which ports to the STLink uP to send data to the STLink Virtual Com port.

I then modified the startup file to include seperate handlers for the different types of hard faults.

See the snippet below:

/*****************************************************************************/
  .section  .text.Reset_Handler
 
/*****************************************************************************/
	.weak  HardFault_Handler
	.type  HardFault_Handler, %function
	HardFault_Handler:
	  movs r0,#4
	  movs r1, lr
	  tst r0, r1
	  beq _MSP1
	  mrs r0, psp
	  b _HALT1
	_MSP1:
	  mrs r0, msp
	_HALT1:
	  ldr r1,[r0,#20]
	  b HardFault_Handler_C
	  bkpt #0
	.size  HardFault_Handler, .-HardFault_Handler

This code replaces the standard hard fault handlers in the Interrupt file which must be commented out or removed so the assmebly files get included in the link. You must also force the linker to use these by making the following mods to another part of the startup file viz:

   .weak      HardFault_Handler
   .thumb_set HardFault_Handler,HardFault_Handler
 
   .weak      MemManage_Handler
   .thumb_set MemManage_Handler,MemManage_Handler
 
   .weak      BusFault_Handler
   .thumb_set BusFault_Handler,BusFault_Handler
 
   .weak      UsageFault_Handler
   .thumb_set UsageFault_Handler,UsageFault_Handler

The assembly code calls various c code handlers depending on which type of fault has occurred viz:

void HardFault_Handler_C(unsigned int *hardfault_args)
{
	_BFAR       = SCB->BFAR;
	_MMAR 			= SCB->MMFAR;
	_CFSR 			= SCB->CFSR;
 
  stacked_r0  = ((unsigned int) hardfault_args[0]);
  stacked_r1  = ((unsigned int) hardfault_args[1]);
  stacked_r2  = ((unsigned int) hardfault_args[2]);
  stacked_r3  = ((unsigned int) hardfault_args[3]);
  stacked_r12 = ((unsigned int) hardfault_args[4]);
  stacked_lr  = ((unsigned int) hardfault_args[5]);
  stacked_pc  = ((unsigned int) hardfault_args[6]);
  stacked_psr = ((unsigned int) hardfault_args[7]);
 
	#if(PRINT_HARD_FAULTS == 1)
		printf("\n\r==== [HardFault_Handler] ====\n\r");
	  printOutFault();
	#endif
 
#if(LOOP_AT_COMPLETION == 1)
		__asm("BKPT #0\n") ; // Break into the debugger
#else
		for(uint32_t i = 1000000; i > 0 ; i--)
		{
			HAL_GPIO_WritePin(ImpulseLED_GPIO_Port, ImpulseLED_Pin, GPIO_PIN_SET);
		}
 
		HAL_GPIO_WritePin(ImpulseLED_GPIO_Port, ImpulseLED_Pin, GPIO_PIN_RESET);
		NVIC_SystemReset();
#endif
}

The hard fault handler prints out details of the fault to a terminal such as Terra Terminal set to the STLink Virtual Com port which can be seen in the Windows Device Manager Com port list.

In my case the fault only occurred in a FIR filter, and only when the FIR filter circular buffer was placed in the DTCM. I picked up the location from the program counter value output by the fault handler printf to the Terra terminal.

It was obvious when I inspected the code where the problem was. When the buffer was placed in the DTCM ram it wasn't zero intialised so that the pointer to the buffer had invalid number in it. When the same buffer pointer was defaulted to the AXI ram it got intialised. I am unsure why it didn't fail when started from debug, but I suspect the DTCM ram must be initialized in the startup phase. Has anyone a theory on this?

Initializing the buffer pointer to zero fixed the problem.

I have attached the relevant files as a zip to assist others. I have included a flowchart (PDF) of the problem I experienced to assist in understanding.

I didn't produce all the code from scratch there are a lot of other peoples work in this I just pulled it together to find a particular bug.

4 REPLIES 4

>>How to debug faults which only occur with the debugger disconnected..

Have complete disassembly listing and map files available so you can pin-point faulting code, or a parallel system you can navigate the code in the debugger.

Have a proper fault handler that outputs actionable data/telemetry so you can diagnose issues.

Instrument your code so you can get flow/dynamics information. Ideally selectable verbosity.

Enable asserts and output file / line number data for these and Error_Handler()

Don't die silently in while(1) loops, impossible to debug "just dies" in field.

Used this type of method for Cortex-M,which I've used for a decade plus

https://github.com/cturvey/RandomNinjaChef/blob/main/KeilHardFault.c

The startup.s code in GNU/GCC is less than ideal clearing memories. ST's implementation is poor, and external memories typically are handled properly by SystemInit()

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Garnett.Robert
Senior III

So, wasn't my method wasn't any good?

It did work.

It didn't take me very long, once I implemented the HFH after a lot of googling. My googling didn't turn up your favourite version of the HFH, or if it did I missed it.

I posted this to give a detailed way of finding these type of faults to help hackers like myself. The devil is in the detail.

In my opinion the C "standard" of zeroing out variables is a trap. I reckon the programmer should take each variable in turn and do initialisation with zero or any other appropriate number and not rely on the peculiarities of embedded C. In may cases variables don't require initialisation at all. I didn't write the FIR Filter someone else did. If I had written it I would have ensured all variables were initialized correctly and not relied on start-up code. They zero'd out the state arrays, but left it to the compiler to zero the pointer. To not explicitly zero the pointer to a circular buffer is surprising.

I like the https://github.com/cturvey/RandomNinjaChef/blob/main/KeilHardFault.c

I will incorporate this into my debugging activities in future.

Have you any idea why the system worked when started from the debugger, but failed when it was power recycled? I thought that the start-up code would be the same whether the debugger in my case gdb or a power-on reset booted the system.

Thank's for your response

Best regards

Rob

The debugger brings up clocks, GPIO, DWT, ITM, DBGMCU and other settings it needs to operate, so the system is not in a normal "reset" condition.

I​t can also ignore BOOT pins.

S​tacks might be better filled with stress patterns, the statics are supposed to be initialized properly. Contractually some should be zero.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Garnett.Robert
Senior III

Hi Tesla,

Debuggers are a bit of a black art to me.

I have filled up my stacks with stress patterns. I always do this with a complex project as it is a quick and easy way to see whether stacks will be a problem. I don't do this in assembler I used C as ARM assembly language is not my strong suite, although I started off hand-coding 6800 D2 kits using the hex keypad so I might have to make an effort and tackle the ARM.

It turned out I had far more stack space than I needed. With stacks I look at the locals, think of a number multiply by ten, take away the number I first thought of and the answer is? Probably incorrect.