Random hardfault bug with STM32F730

Jukka Lamminm�ki · ‎2019-06-06

We are struggling with mysterious bug with our STM32F730 project, even the bug itself is not easy to explain.

We use FreeRTOS and our code is run XIP from QSPI FLASH SST26VF016B.

IDE used is Atollic version: 9.1.0.

The problem is that the application crashes to hardfault vector, but not always.

We sometimes can build the binary which runs perfectly for hours or "forever". But if we add randomly line of code somewhere the binary crashes within random time: in seconds, minutes or hours - not even nearby the line where the code was added.

When debugging the problem, one build always crashes at about the same place and another build in totally different place. There seems to be nothing common in between these places.

With debugger we can have the LR, SP and PC values to the place where the hardfault occurs, but that is no helpful: those points to the code in which there's nothing wrong and which has already been run thousands of times before suddenly crashing.

Debug trace usually shows the last Signal handler called addressees 0xFFFFFFF1 or 0xFFFFFFFD. The problem seems to be somehow asynchronous, related maybe to interrupts.

We do have the simple test software build with the same drivers and it has never crashed, although it is using the same interfaces: SPI, three UARTS, AD, floating point calculations.

We have tried SW and HW floating points, different QSPI speeds and different compiler options.

What we would need is to have some trace of code flow just before the hardfault occurs, but we can't put a brakepoint anywhere in the code.

Does anybody have any hints how to detect the bug? Has anybody encountered similar problems?

Jukka Lamminm�ki · ‎2019-06-18

Some news, good ones maybe:

We have month or two earlier tried reducing the QSPI speed by increasing the QSPIHandle.Init.ClockPrescaler value from 2 to 8, with no effect to the bug.

Now the findings yesterday made us think that the problem must be somehow HW-related, so we increased the prescaler value to 16.

And after that the application stopped crashing.

The original QSPI-speed was 72MHz, with prescaler value 16 it is 12.7MHz. 104 MHz operational speed is promised to the memory chip.

We don't know if the actual problem has now been solved, or if the timing simply has now changed so that it doesn't occur any more.

Also the little brains of simple engineer's - referring to me, not my colleagues - can't figure out how the speed related HW-problem always crashes the software at exactly the same place per build. One could expect it to be more random.

We'll continue testing and throw the ball also to the HW department.

If anything new occurs, we'll post it to this queue.

Big thanks to all you who have responded to this message queue. All the answers have been valuable and have helped us to get this far.

waclawek.jan · ‎2019-06-26

This, together with the fact that the faults don't make sense for the given instructions, indeed appear to indicate incorrectly read data (instructions) from the QSPI FLASH. I have no explanation for the reproducibility.

JW

JLiri.1 · ‎2022-07-21

Hello! The last post not this thread was 3 years ago BUT I had the exact same issue as described in this thread with the STMG0B1RCT part number. It is my hope that someone with a similar problem reads all the way down here and this helps them.

In short, I solved this problem by setting the following in stm32g0xx_hal_conf.h:

#define PREFETCH_ENABLE 0U

#define INSTRUCTION_CACHE_ENABLE 0U

This solved the wacky and random hard fault problems I was having. They would trigger sporadically without rhyme or reason. Sometimes the code would run for several minutes and at other times it would crash on boot.

The cause seems to be my clock rate: I am running at the maximum speed of 64 MHz for the part. The code correctly set the flash latency to FLASH_LATENCY_2 (the longest delay).

It seems that my chip was running slightly over 64 Mhz (or Vcore was slightly under 1.2V) (there is a tolerance to these values). Page 72 of the manual indicate the correct flash delay setups based on clock frequency and core voltage.

The flash accesses were marginally successful, until they were not.

I hope this helps!

Piranha · ‎2022-07-21

Most likely the issue is still there and just doesn't happen or happen less often. The HAL/Cube broken bloatware is full of bugs and we don't know your code.

Or, especially if it's a custom board, it could be a hardware problem.

Tesla DeLorean · ‎2022-07-21

Bit of a dissimilar part, but yes the ART contraption has been a source of "critical-paths" in the past, and the FLASH still likely has access times in the 35ns realm, so perhaps 28 MHz, I've generally tried to keep the access in the 24-27 MHz type range.

In other CM0 designs, a device running slower (25 MHz), with zero wait-states (ie 1 + 0), could out pace a device running at 32 MHz, with one wait-state (ie 1+1 cycles)

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Jukka Lamminm�ki · ‎2022-07-28

Brief update: with the original problematic hardware we have supply voltage 2.8V for the MCU and the memory.

In the parallel design - with same MCU and memory - we have supply voltage 3.3V and can run the memory with QSPI-clock prescaler value 3 compared to the value 16 of the original design.

Design layout regarding the memory lines probably is not exactly the same, which can also affect, but we have a feeling that increase in supply voltage alone makes the difference.

Piranha · ‎2022-07-28

By the way...

> Never got D-cache working properly

With a help of @Pavel A. and someone in the ARM forum, this was finally solved. Turned out to be an incorrect documentation from... absolutely all manufacturers involved! I recommend reading this topic, including the comments:

https://community.st.com/s/question/0D53W00001Z9K9TSAV/maintaining-cpu-data-cache-coherence-for-dma-buffers