2019-06-06 06:06 AM
We are struggling with mysterious bug with our STM32F730 project, even the bug itself is not easy to explain.
We use FreeRTOS and our code is run XIP from QSPI FLASH SST26VF016B.
IDE used is Atollic version: 9.1.0.
The problem is that the application crashes to hardfault vector, but not always.
We sometimes can build the binary which runs perfectly for hours or "forever". But if we add randomly line of code somewhere the binary crashes within random time: in seconds, minutes or hours - not even nearby the line where the code was added.
When debugging the problem, one build always crashes at about the same place and another build in totally different place. There seems to be nothing common in between these places.
With debugger we can have the LR, SP and PC values to the place where the hardfault occurs, but that is no helpful: those points to the code in which there's nothing wrong and which has already been run thousands of times before suddenly crashing.
Debug trace usually shows the last Signal handler called addressees 0xFFFFFFF1 or 0xFFFFFFFD. The problem seems to be somehow asynchronous, related maybe to interrupts.
We do have the simple test software build with the same drivers and it has never crashed, although it is using the same interfaces: SPI, three UARTS, AD, floating point calculations.
We have tried SW and HW floating points, different QSPI speeds and different compiler options.
What we would need is to have some trace of code flow just before the hardfault occurs, but we can't put a brakepoint anywhere in the code.
Does anybody have any hints how to detect the bug? Has anybody encountered similar problems?
2019-06-06 07:45 AM
You'd have to pull the processor registers off the stack and also look at the ones not stacked.
If the failing address is consistent, you need to look very carefully at the instructions immediately prior and with the context of the registers.
64-bit reads, or multiple reads will fault on unaligned addresses.
Have your code use a hard fault handler that outputs diagnostic information to the console, I've posted code examples to do this dozens of times.
Have your other code output telemetry and flow information, perhaps enabled by a memory variable so you can turn it on/off.
The hard fault will typically have a magic return address (ie LR = 0xFFFFFFFD) as this causes the processor to unwind the context it pushed on the stack for the event/fault. Review a manual on the core if this concept is not familiar to you.
2019-06-06 11:09 AM
> We do have the simple test software build with the same drivers and it has never crashed
Was this running from the QSPI too?
> We use FreeRTOS
Have you analyzed the stacks for overflows?
JW
2019-06-06 10:12 PM
The test software is running from the QSPI FLASH and stacks are never overflown.
2019-06-07 01:31 AM
For one thing, using the FPU in interrupts has certain implications (lazy stacking ?).
Have you tried a heat gun to check for a relation to temperature, or reducing clock frequency ?
If so, it might be the Flash settings.
2019-06-07 02:19 AM
We have kept the interrupts simple, no FPU used there.
External conditions seems to be not the issue, bad build crashes to the same spot with different devices. One build can manage e.g. 15 minutes, in room temperature or in Finnish spring temperatures near zero degrees C - must say though that today it is about +30C outside.
We have tried different QSPI prescalers with no effect.
2019-06-07 02:36 AM
Reduced clock speed ?
Reduced optimization settings ?
Could a "stable build" provoked to hardfault at elevated temperature ?
Has elevated temperature an effect on the "MTB-hardfaults" ?
Just some ideas, when a systematical approach seems to lead nowhere...
2019-06-07 03:04 AM
Is D-cache used? If yes, try not enabling it and see if that changes anything. Though that also requires not calling relevant clean/invalidate functions, if those are used in code.
2019-06-07 05:33 AM
Compiler optimizations definitely affects the behaviour, but it doesn't remove the problem: changing optimization e.g. from size to speed may end up to good stable build, but then adding a line of code somewhere can again damage the build.
Environmental effects is worth trying, though
2019-06-07 05:34 AM
We have only enabled I-cache. Never got D-cache working properly