Random hardfault bug with STM32F730

Jukka Lamminm�ki · ‎2019-06-06

We are struggling with mysterious bug with our STM32F730 project, even the bug itself is not easy to explain.

We use FreeRTOS and our code is run XIP from QSPI FLASH SST26VF016B.

IDE used is Atollic version: 9.1.0.

The problem is that the application crashes to hardfault vector, but not always.

We sometimes can build the binary which runs perfectly for hours or "forever". But if we add randomly line of code somewhere the binary crashes within random time: in seconds, minutes or hours - not even nearby the line where the code was added.

When debugging the problem, one build always crashes at about the same place and another build in totally different place. There seems to be nothing common in between these places.

With debugger we can have the LR, SP and PC values to the place where the hardfault occurs, but that is no helpful: those points to the code in which there's nothing wrong and which has already been run thousands of times before suddenly crashing.

Debug trace usually shows the last Signal handler called addressees 0xFFFFFFF1 or 0xFFFFFFFD. The problem seems to be somehow asynchronous, related maybe to interrupts.

We do have the simple test software build with the same drivers and it has never crashed, although it is using the same interfaces: SPI, three UARTS, AD, floating point calculations.

We have tried SW and HW floating points, different QSPI speeds and different compiler options.

What we would need is to have some trace of code flow just before the hardfault occurs, but we can't put a brakepoint anywhere in the code.

Does anybody have any hints how to detect the bug? Has anybody encountered similar problems?

pprovencher · ‎2019-06-14

Thanks to DEADBEEF, I found the problem (not hardfault problem though). It was obviously really simple once you find it! I did not use a 32-bit but a 16-bit variable to pass by reference to a FATFs function asking for a 32-bit!!!. FATFS was initializing the var with 0 and in the process, it was overwritting another 16-bit variable just beside in the stack.

All this adventure forced a better understanding of the tools I have to debug and a better understanding of the STM32 universe. It also showed me how much a little change in the code can change a lot in the way the code will execute and position variables in stack for example. Overall, I'm happy that I had this bug at this stage in my project so I can be quicker to nail the next ones.

Thanks guys for your support.

Sorry @Jukka Lamminmäki , I don't think our bugs are related finally! Hopefully, you will nail it soon!

Piranha · ‎2019-06-14

PC points to the line:

float hdop = atof(data);

---

The question is still valid - are data and other string variables in all places guaranteed to always be zero-terminated, when passing them to strchr and other string processing functions? :)

waclawek.jan · ‎2019-06-15

I see no reason for the NOCP fault, as both CP10 and CP11 are enabled.

At this point, I'd start to suspect hardware. Does the fault occur on multiple instances of the hardware? Is the power supply rock solid, as observed directly on the power pins? What's the voltage, can it be changed to slightly lower/higher? Are all supply and ground pins connected properly? Have been all pins checked for bad solder joints? Decoupling capacitors, especially VCAP, have been scrutinized?

JW

turboscrew · ‎2019-06-15

Also, it might reveal something if you disassembled some code and searched for coprocessor accesses (CDP or CDP2). Maybe some other coprocessor gets accessed. Shouldn't, but then again this fault shouldn't happen in the first place. Maybe - just maybe - there is a bug in the compiler or compiler setup.

Maybe the execution gets into some literal pool or something...

Piranha · ‎2019-06-15

One more idea. Maybe unaligned stack like here:

https://community.st.com/s/question/0D50X0000AldaPzSQI/cubeide-sprintf-does-not-work-with-f

turboscrew · ‎2019-06-16

It's, however, hard to see it causing NOCP usage exception.

Jukka Lamminm�ki · ‎2019-06-16

We don't believe that it is about the HW, to us it seems to be all about the builds: badly behaving build crashes in all the devices, good build is stable in all the devices.

Jukka Lamminm�ki · ‎2019-06-16

Truly glad that you solved your problem, thanx for posting the solution here. We may have deep down similar fault, but been unable to spot it out yet.

Jukka Lamminm�ki · ‎2019-06-17

Wonder if we found out something: we are now investigating two differently crashing builds, which shows the different hardfault reason.

build1: 90002704: ec51 0b10 vmov r0, r1, d0 ; this shows the NOCP reason

build2: 900025be: f000 80c8 beq.w 90002752 ; this shows the reason "attempt to perform an unaligned access (UNALIGNED)"

neither of them makes sense, since both lines have been run successfully beforehand and the jump address in latter case is perfectly correct.

However common with both of these lines is that associated Thumb instructions needs two clock cycles to run.

Could there be some rare occurance of reading failure from the flash?

However, there are hundreds of these kind of commands in the code, why this happens always in exactly the same place per build?

Jukka Lamminm�ki · ‎2019-06-17

Just rechecked the data and string variables used in crashing thread: they are always nul terminated, the collecting functions can't exit without setting the terminating nul