Random hardfault bug with STM32F730

Jukka Lamminm�ki · ‎2019-06-06

We are struggling with mysterious bug with our STM32F730 project, even the bug itself is not easy to explain.

We use FreeRTOS and our code is run XIP from QSPI FLASH SST26VF016B.

IDE used is Atollic version: 9.1.0.

The problem is that the application crashes to hardfault vector, but not always.

We sometimes can build the binary which runs perfectly for hours or "forever". But if we add randomly line of code somewhere the binary crashes within random time: in seconds, minutes or hours - not even nearby the line where the code was added.

When debugging the problem, one build always crashes at about the same place and another build in totally different place. There seems to be nothing common in between these places.

With debugger we can have the LR, SP and PC values to the place where the hardfault occurs, but that is no helpful: those points to the code in which there's nothing wrong and which has already been run thousands of times before suddenly crashing.

Debug trace usually shows the last Signal handler called addressees 0xFFFFFFF1 or 0xFFFFFFFD. The problem seems to be somehow asynchronous, related maybe to interrupts.

We do have the simple test software build with the same drivers and it has never crashed, although it is using the same interfaces: SPI, three UARTS, AD, floating point calculations.

We have tried SW and HW floating points, different QSPI speeds and different compiler options.

What we would need is to have some trace of code flow just before the hardfault occurs, but we can't put a brakepoint anywhere in the code.

Does anybody have any hints how to detect the bug? Has anybody encountered similar problems?

waclawek.jan · ‎2019-06-11

> how to catch it when we do not even have a hardfault occurring!?

Oh, bugs are a live thing and not all of them fail in a neat reproducible way clearly indicating the point of failure. The "random bug" is the worthy one to catch, and how to hunt them down, is the art of this trade.

So you have to have at hand a whole repertoaire of tools - hardware like probes, oscilloscope you have a good grip at, logic analyzer, even the humble multimeter. "But I am a software engineer" is a whine to be left out of the door. Of course you have to have documentation at hand, and consult it often. And software - have a good grip at the toolchain and its darker corners, know thy mapfile. It's good to have a sleeveful of tricks in the mcu's software, too - like knowing how to toggle pins to be observed by LA, how to output or store relevant debug info/files without relying on printf() and/or any "semihosting" automagic, how to use otherwise unused resources of the mcu like memories embedded in the peripherals, etc. But, when it comes to bugs, disassembler is your biggest friend, together with the on-chp-debugging utilities (but those have to be known and understood well, too). And remember - software, or the box on the table, may be called debugger, but the real debugger sits on your chair.

So be innovative. The first thing is to get the bug reproduce, at least in some way. Enhance reproduction by stressing the application - churn on it communication at maximum speed, connect pushbuttons to random pulse generators, feed the audio processor with noise or some 10-hour youtube ***, deliberately overdriving the input. Carefully observe symptoms - even if they may be unrelated to your initial complaint or any theory you might have formulated, deviation of any form from what is "normal" *must* be explained.

Try to catch the bug. It may be that the symptoms indicate the bug has happened far before the symptoms occur, but try to halt execution as soon as observable symptoms occur. For example, if stack overflow is suspected to be the raw cause, then using the DEADBEEF method this theory can be safely disproved even at a late catch if some DEADBEEF remains (it can't be safely confirmed but suspicion certainly increases in that direction if there's no DEADBEEF left).

Then try carefully crafted changes, observe the behaviour. Understand, what the changes really mean. Try to make the changes very local - this is hard to do I know. Make sure you can always return to the reproducible state. It may help to take notes at this point.

Formulate theories, then devise experiments to prove or disprove them. You suspect stack overflow? How could overflow be catched? It's a write to an address beyond the last address allocated to the stack, so what about using the on-chip-debugger's data breakpoint facility; or, if desperate, the MPU? Or, maybe a simpler

You may also try the divide et impera method I mentioned - omitting whole blocks of program. Sometimes it gets obvious what's wrong. Not often; but hey, this is the Real Bug, so there's little to lose. Still make sure you can return to the reproducible state.

Sleep well. Discuss the problem with your colleagues, with the cleaning lady, with your partner at home, with your teddy bear.

Be persistent. When the Real Bug creeps in, deadlines are void. Getting a product out of the door knowing there's a bug - well, that gives "deadline" the real meaning. If a manager pops in demanding progress, explain him the problem in great technical detail (and don't let him slip out), and then ask him to hold an oscilloscope probe - yes, ON the PIN 57 of LQFP176, NOT on the pad! - while making your software experiments finally with some comfort. An hour or two should suffice.

All this takes years to get good at. And then, one day, comes a Real Real Bug, and I - with all those years and knowledge and experience - then look as a complete a****ole anyway... ;)

JW

turboscrew · ‎2019-06-11

That text should be put in the beginning of every book about learning embedded development.

turboscrew · ‎2019-06-11

I still think it's a good idea to check the fault registers and the exception stack frame, if it is a synchronous fault.

Jukka Lamminm�ki · ‎2019-06-12

It is somewhat relieving for us to learn that someone else has got exactly the same problems. If you @pprovencher ever find out the reason to your problems, please post it here.

Jukka Lamminm�ki · ‎2019-06-12

We spent yesterday examining the faulty build, fault registers of exception I already posted in June 10, 2019.

First we thought that the odd value in LR was the reason to the exception, and it took a while for us to understand that some library functions are using the register for their own purposes and the actual return address is popped from the stack.

We also learned that stack was not overflown - as it never was in the earlier exceptions either. There were no interrupts active at exception point.

What we learned - with this specific build - is that the line where the exception occurred is the following:

90001ed6: blx r7 // atof call

90001ed8: ldr r6, [pc, #64] ; (0x90001f1c <MC60E_Gnss::ParseGGA(char const*, unsigned long)+228>)

90001eda: vmov r0, r1, d0

90001ede: blx r6

and the usage fault reason for the exception is: Attempt to execute a coprocessor instruction (NOCP)

and this is the line - which indeed seems to be a coprocessor instruction - that already has been run several times before the exception occurs. It has also been calculating the correct floating point values. Stack and register values seems to be correct in the exception point.

We also have a very different failing build, which we continue to examine today.

Jukka Lamminm�ki · ‎2019-06-12

Thanx JW for the supportive thoughts. I've been on business for thirty years and still occasionally encounter problems that seem to challenge all one have ever learnt. But all the bugs have been solved - at least as I remember - or they could have been attributed to the hardware.

pprovencher · ‎2019-06-13

Thanks JW! With your insight and other forum posts, we manage to find what is our hardfault problem. It is related to a register getting a wrong RAM or stack address which is not in the range of the memory so the hardfault is thrown! Now we need to find why it is happening... I feel that the problem that is NOT causing a hardfault is caused by a variable being overwritten on the stack. I'll try the DEADBEEF method to find out! Thanks again

waclawek.jan · ‎2019-06-13

> and the usage fault reason for the exception is: Attempt to execute a coprocessor instruction (NOCP)

Can you reproduce this fault? If yes, can you then read out the content of SCB_CPACR?

JW

turboscrew · ‎2019-06-13

That's why I pushed to check the CFSR.

The CPACR, Jan Waclawek mentioned, has a bitmap telling which coprocessors are implemented. FPU is implemented as coprocessors 10 and 11.

You might also want to check FPCCR.

BTW, when you have problems with core (as opposed to peripherals), Arm-manuals are more "it". In your case: https://static.docs.arm.com/ddi0403/eb/DDI0403E_B_armv7m_arm.pdf

Jukka Lamminm�ki · ‎2019-06-14

Here are some register values from the crashing point:

to us they seem to be quite normal - though our understanding on this area is not on very high (if any) level.

In our compiler settings the FPU is FPv5-SP-D16 and - as said before - the calculations are performed successfully and correctly until this crash occurs