Stm32f103, hardfault problem

gnu · ‎2014-10-02

Posted on October 02, 2014 at 14:23

Hello everyone,

I've this hardfault problem for a quite long time now, I solved the issue by putting a software reset in the hardfault loop but now is not acceptable anymore.

I suspect it has something to do with sending and receiving messages over the power line. It happens once a day more or less so it's difficult to predict when it occurs.

I am using the Joseph Yiu hardfault handler code to find the cause of it.

I attached a screen with the values I get on the right side. I have also included the map file and the memory.

When I analyse the PC I don't see anything wrong.

Could you please give me a hand finding the source of the problem?

Thank you for your attention!

frankmeyer9 · ‎2014-10-02

Posted on October 02, 2014 at 14:44

I suspect it has something to do with sending and receiving messages over the power line. It happens once a day more or less so it's difficult to predict when it occurs.

Runtime errors are, indeed, sometimes difficult to find.

An approved method is code instrumentation, i.e. adding debug output, and logging it.

(This could also be done via GPIO pin toggling, a scope with an appropriate trigger condition set).

Assuming it is not caused by an untested code path, two causes come to my mind.

First is a stack overflow, which would be easy to test and correct, and which should aggravate with additional instrumentation.

A second, common cause are accesses to runtime-calculated addresses. For instance, an index calculation for an array, getting out of bounds.

A non-software related cause could be EMI issues - powerline communication suggests mains voltage somewhere on your board. You could try galvanic isolation for all channels to your MCU, or with simulated PLC input in an EMI-free environment.

stm322399 · ‎2014-10-02

Posted on October 02, 2014 at 15:38

Well, there is a write to 0x20005007, which is certainly the cause of the hardfault.

How did we reach that point ? The running code is IAR memcpy (that is supposed to work). Registers values are alarming because R2 shall be the byte count and it is negative. I guess it is the first occurrence of the copy loop otherwise the BCS would had stop the loop. The target pointer is unaligned, which shall not arrive when executing this loop (made for aligned pointers).

This can happen when an interrupt handler did not restore correctly the register (hard to do with cortex-M3 and C-like interrupt handlers).

Another explanation is that memcpy had 0x20005000 as destination argument (this is the end of SRAM?) due to another error.

It would be good to expose the full call stack at the moment of the hardfault (put a breakpoint at HardFault_Handler), this could help to better understand what's going on.