Random hardfault bug with STM32F730

Jukka Lamminm�ki · ‎2019-06-06

We are struggling with mysterious bug with our STM32F730 project, even the bug itself is not easy to explain.

We use FreeRTOS and our code is run XIP from QSPI FLASH SST26VF016B.

IDE used is Atollic version: 9.1.0.

The problem is that the application crashes to hardfault vector, but not always.

We sometimes can build the binary which runs perfectly for hours or "forever". But if we add randomly line of code somewhere the binary crashes within random time: in seconds, minutes or hours - not even nearby the line where the code was added.

When debugging the problem, one build always crashes at about the same place and another build in totally different place. There seems to be nothing common in between these places.

With debugger we can have the LR, SP and PC values to the place where the hardfault occurs, but that is no helpful: those points to the code in which there's nothing wrong and which has already been run thousands of times before suddenly crashing.

Debug trace usually shows the last Signal handler called addressees 0xFFFFFFF1 or 0xFFFFFFFD. The problem seems to be somehow asynchronous, related maybe to interrupts.

We do have the simple test software build with the same drivers and it has never crashed, although it is using the same interfaces: SPI, three UARTS, AD, floating point calculations.

We have tried SW and HW floating points, different QSPI speeds and different compiler options.

What we would need is to have some trace of code flow just before the hardfault occurs, but we can't put a brakepoint anywhere in the code.

Does anybody have any hints how to detect the bug? Has anybody encountered similar problems?

Piranha · ‎2019-06-09

Are interrupt priorities correct respective to FreeRTOS requirements described here?

https://community.st.com/s/question/0D50X0000AurNZJSQ2/real-time-isr-no-interruptions-using-the-rtos-middlewares-how

Here it really has to be asked - which drivers are You using? HAL, LL or other?

Jukka Lamminm�ki · ‎2019-06-09

The interrupt priorities are the following:

#define configPRIO_BITS 4 /* 15 priority levels */

#define configLIBRARY_LOWEST_INTERRUPT_PRIORITY 15

#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY 5

/* The lowest priority. */

#define configKERNEL_INTERRUPT_PRIORITY ( configLIBRARY_LOWEST_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )

/* Priority 5, or 95 as only the top four bits are implemented. */

#define configMAX_SYSCALL_INTERRUPT_PRIORITY ( configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )

and all the interrupt priorities we use are related to configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY:

configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY + n, where n is 1 ... 4

We use three UARTS, DMA based and one SPI, interrupt based

We are using HAL libraries, revision 1.2.5, 02-February-2018

Jukka Lamminm�ki · ‎2019-06-10

One example just detected.

Fault analyzer of the debugger shows that the LR when entered hardfault handler points to the second if statement in the following function:

static bool ValidateNMEAChecksum(const char* nmea)

{

if (nmea[0] != '$')

{

return false;

}

const char* checksumPointer = strchr(nmea, '*');

if (!checksumPointer)

{

return false;

}

***

and PC points to the while statement in the following method:

const char* xxxx_Gnss::GetNextData(const char* data) const

{

if (!data)

{

return nullptr;

}

uint32_t i = 0;

while (data[i] != ',')

{

if (data[i] == 0)

{

return nullptr;

}

i++;

}

***

This makes no sense, since either of the functions are called from inside each other. Both are used from the same thread, first is called the ValidateNMEAChecksum() function and after that the GetNextData() method.

Piranha · ‎2019-06-10

Are nmea and data variables guaranteed to always be zero-terminated?

F7 HAL has two newer versions. 1.2.6 has updates related to F730 and 1.2.7 has significant amount of improvements, especially for UART.

By the way, as it seems that You care quality... There is a practice for pointers to always be compared to == / != NULL.

https://stackoverflow.com/a/1284067

Jukka Lamminm�ki · ‎2019-06-10

The previous example was one-time-happening. These following two examples with different build - compiler optimization settings modified - seems to be more often occurring - though not in every run:

PC points to the line:

float hdop = atof(data);

Our code is linked starting from 0x9000 0000 and SRAM memory used is MCU's internal SRAM, 0x2000 0000.

Stack is not overrun.

And this build runs normally for minutes, and then suddenly crashes with above shown results

turboscrew · ‎2019-06-10

The return code ib LR also tells you which stack is used for pulling the frame at return and which mode it then returns to (thread/handler).

To be on the safe side, it might be good to check the stack frame.

turboscrew · ‎2019-06-10

Have you checked the contents of CFSR and HFSR?

waclawek.jan · ‎2019-06-10

I once chased a randomly occuring error, which turned out to be hardware - a piezoelement-generated spikes got into one of the STM32 pins through and unexpected parasitic path.

The automagically decomposed "last C line" is completely worthless in this case. Go to the disassembly and look at the last (i.e. the one before the one pointed to by the PC content stored on stack), sometimes also a couple of instructions before that. From that, the content of fault registers mentioned by turboscrew (because in F7 hardfault is rarely other than escalation of some other fault), and the content of registers (some of them stored on stack), it should be immediately clear, exactly which one of them is the culprit and why; work then back through the mixed source/disasm to find the C source and understand the underlying root problem.

But in case of wildly varying and completely unreproducible errors, this rarely leads to a solution. As Clive said, work back through the stack and try to find some recognizable patterns. RTOS makes this harder to almost impossible; that's one of the joys of using multitasking, but that was your choice.

Another approach is divide et impera: start removing things from your program, until it stops crashing, and/or gradually add until crashes again, and then try to ascertain whether the removed/added component might be the culprit.

> stacks are never overflown.

This sounds to be a very confident assertion. How do you know?

JW

Tesla DeLorean · ‎2019-06-10

With @Piranha on the NUL termination, strchr() is potentially unbounded.

Instrument the code, walk the buffer, perhaps have a constrained length here, and test for it, dumping buffers which fail.

Try to establish a pattern/flow to failure, especially if it is at same place consistently.

Optimization changing behaviour often points to logic or initialization conditions in the code.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

pprovencher · ‎2019-06-11

we are experiencing something very similar with a STM32F413, not using RTOS but using UART, CAN and SDIO as eMMC interface using FATFs as a high level memory management. We are using the ST HAL but the rest of the code is ported from a Freescale HCS12X project. Atollic 9.3, latest HAL libs

But it was working pretty well since we started using the STM32. We were adding new libraries one at a time making them work before adding the next one. At some point, the eMMC was functional with FATFs and I added some lines in one of the function and it has started our nightmare! If I added more lines to debug the problem, the problem disappeared. If I change the return type of the function from bool to byte, it works!?!? At some point, changing the compiler optimization to none was fixing the problem too. In fact, the weird thing here is my function seems to return true all the time but the function which is calling it is receiving/reading/changing the result to false!!!

My colleague had something similar. Using the same code as mine, he was testing a new library but when he added a new function call in the main loop, we stopped processing one CAN message but others were still processed correctly! He changed some other lines and hardfault was occurring.

I know my post is not there to help but maybe it can show it can happen not only with QSPI or RTOS but with many different setups.

I'm not as advanced as you are guys, but my feeling tells me it is related to a buffer overflow or a stack overflow depending on the case. But how to catch it when we do not even have a hardfault occurring!?

PP