Help with debugging an INVSTATE hard fault on STM32F405 microcontroller

balkers · ‎2018-10-18

I'm having a hard time tracking a source of the hard fault occuring in my program written in C++ for the STM32F405 microcontroller.

I only use three sources of interrupts in this project - two timer interrupts (TIM2 overflow happening with 36kHz frequency having intermediate priority, and the TIM3 overflow interrupt happening at 12kHz frequency, having lowest priority) and also a DMA1 transfer complete interrupt with highest priority.

The hardest part of tracking that error comes from the fact that it seems completely unpredictable - it can happen a few seconds into the program, or the whole program can run for half an hour without problems and then crash.

Using the build-in fault analyzer of Atollic True Studio I've been able to find out that the actual type of fault is INVSTATE. I did some reading on this issue, and apparently this error could be caused by trying to branch to an instruction with an even address (LSB equal to 0). Fortunately, the fault analyzer does all the stack-involving calculations for user and allows to jump to the faulty C or assembly line.

I jumped to the assembly line, and turns out this is the faulty line:

080022fe:  cmp   r4, r8

I was confused at first, because there's no branching involved in this line, but I think that it may be the first line AFTER the fault has occured, so I looked at the previous line:

  080022fa:  bl   0x8001dd4 <Generator::getOutput()>

So there's our branch. I'm assuming that the program jumps to Generator::getOutput() function, and crashes on it's return. This function call happens in the ISR of TIM3 overflow.

Looking at fault analyzer's snapshot of the registers' state just before the fault I noticed a weird thing - the link register pointed to the location in the RAM, not in FLASH memory 0x20001e70. The function that supposedly crashes ends with `bx lr` instruction, so that would explain the INVSTATE fault (the address is even). Also, that address points to some field of a statically alocated object, not to the stack.

This value appears in LR register only when the fault occurs - like I mentioned, the program can run for minutes without crashing, which means that this function gets called millions of times, and when I stepped through this function many times the LR register always contained proper return value (`0x080022ff`)

So the thing that is bugging me - where could this value come from? There are no further function calls or branches in the `getOutput()` function, so the LR register *does not* get updated there. The only interrupt with higher priority than currently serviced interrupt is the DMA Transfer Complete interrupt. Here's the DMA ISR:

  void DMA1_Stream4_IRQHandler(void) {
  	if(DMA_GetITStatus(DMA1_Stream4, DMA_IT_TCIF4))
  	{
  		DMA_ClearITPendingBit(DMA1_Stream4, DMA_IT_TCIF4);
  		if (channel_to_send < 4) {
  			dac.DmaSend(channel_to_send++);
  		}
  	}
  }

As you can see, the ISR calls the DmaSend function. Here's it's code:

  void Dac::DmaSend(uint8_t channel) {
  	uint32_t value = offset_[channel] + register_[channel]*multiplier_[channel];
  	if (value > 65535) {
  		value = 0;
  	} else {
  		value = 65535 - value;
  	}
  	GPIOC->BSRRL |= GPIO_Pin_5;						
  	output_buffer_[0] = (1 << 5) | (channel << 1);
  	output_buffer_[1] = (uint8_t)(value >> 8);
  	output_buffer_[2] = (uint8_t) value;
  	GPIOC->BSRRH |= GPIO_Pin_5;
  	DMA_Cmd(DMA1_Stream4, ENABLE);	
  }

Is it possible that the DMA interrupt service routine somehow corrupts the link register? The only real memory write (outside of writing the SFRs) happening in the ISR is the write to output_buffer_ array, and it does not overflow (the array is declared as uint8_t output_buffer_[3]). If not, then what could possibly corrupt the LR value? Or am I missing something here and the value is compeletely fine?

waclawek.jan · ‎2018-10-18

I'd say you may have made conclusions a bit too far. Post the content of registers and the stack when the fault hits. BL can't throw an Usage Fault (if that's what you get), and BX LR with invalid LR content wouldn't return to the caller, i.e. you wouldn't see PC to point to the calling function.

JW

Tesla DeLorean · ‎2018-10-18

What would actually be helpful is a dump of the registers, and a more complete disassembly around the fault and the routines being called.

Also look at that the code at 0x8001dd4 <Generator::getOutput()>

Look at the code the processor is executing, perhaps walk a listing file.

See my recommended Hard Fault Handler code, https://community.st.com/s/question/0D50X00009bNCRgSAO/hi-im-using-stm32f103zd-and-iwm-getting-after-some-hours-hard-faulthow-can-i-detect-the-reason-and-solve-the-problem

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

balkers · ‎2018-10-19

thanks to both of you, I'll post the registers as soon as the hard fault occurs again. I've changed the interrupt priorities (now the TIM3 has a higher priority than DMA), and the fault haven't happened for 24hrs of continuous operation, so it may take a while.

Wonder if changing the priorities has cured this issue, or is it just kicking the can further down the road and the crash will eventually occur, but after a much longer period of time.

balkers · ‎2018-10-25

ok, so I've made some further changes to my program and I get the fault at the same line as before.

Here's the register dump from the trueStudio's fault analyzer:

sp (MSP) 0x2001fefc
r0       0x2
r1       0xe740
r2       0x8047ec8
r3       0xff37
r12      0x20001700
lr       0x0
pc       0x8002697
xpsr     0x20001ab4

Interestingly, the LR is not pointing to the RAM anymore, but it's value is 0x0.

Here's the disassembly from around the faulty line:

08002682:   vldr    s14, [r5, #12]
08002686:   vldr    s15, [r6]
0800268a:   vfma.f32        s15, s14, s16
0800268e:   vstr    s15, [r6]
08002692:   bl      0x800215c <Generator::getOutput()>
 29       	for (int i = 0; i < 4; ++i) {
08002696:   cmp     r4, r8
08002698:   add.w   r5, r5, #56     ; 0x38
0800269c:   strh.w  r0, [r7], #2
080026a0:   add.w   r6, r6, #4
080026a4:   bne.n   0x800267a <GeneratorWrapper::output()+34>

waclawek.jan · ‎2018-10-25

> Here's the register dump from the trueStudio's fault analyzer:

Odd PC value is unexpected. Can you please show us the content of stack, immediately after the fault?

Do you have some fault handler installed? If so, can you please replace it with a dumb loop?

Try to avoid any "facility" imposed by Eclipse.

JW

PSali · ‎2018-11-27

I had similar problems with STM32F401. I had several different fault types occurring and widely varying intervals across hundreds of systems. I implemented Keil AN209 hardfault handler and bought a J-LINK. After weeks of analysis and tweaking I was not able to find a definitive root cause. Turning off the Prefetch Buffer solved it.

waclawek.jan · ‎2019-05-30

> Turning off the Prefetch Buffer solved it.

That to me sounds much like inadequate power supply/grounding/VCAP.

JW

turboscrew · ‎2019-05-30

Does the xPSR value look odd to anyone else?

As if there has been return from an exception and the stack had been corrupted...

The T-bit should be '1', and I don't think there is an exception with the number in the last 9 bits.

If T-bit is zero, it'll cause the INVSTATE.