Associate III

Question

Random hardfault bug with STM32F730

Forum|Forum|7 years ago
June 6, 2019
27 replies
8031 views

We are struggling with mysterious bug with our STM32F730 project, even the bug itself is not easy to explain.

We use FreeRTOS and our code is run XIP from QSPI FLASH SST26VF016B.

IDE used is Atollic version: 9.1.0.

The problem is that the application crashes to hardfault vector, but not always.

We sometimes can build the binary which runs perfectly for hours or "forever". But if we add randomly line of code somewhere the binary crashes within random time: in seconds, minutes or hours - not even nearby the line where the code was added.

When debugging the problem, one build always crashes at about the same place and another build in totally different place. There seems to be nothing common in between these places.

With debugger we can have the LR, SP and PC values to the place where the hardfault occurs, but that is no helpful: those points to the code in which there's nothing wrong and which has already been run thousands of times before suddenly crashing.

Debug trace usually shows the last Signal handler called addressees 0xFFFFFFF1 or 0xFFFFFFFD. The problem seems to be somehow asynchronous, related maybe to interrupts.

We do have the simple test software build with the same drivers and it has never crashed, although it is using the same interfaces: SPI, three UARTS, AD, floating point calculations.

We have tried SW and HW floating points, different QSPI speeds and different compiler options.

What we would need is to have some trace of code flow just before the hardfault occurs, but we can't put a brakepoint anywhere in the code.

Does anybody have any hints how to detect the bug? Has anybody encountered similar problems?

This topic has been closed for replies.

Tesla DeLorean

Guru

You'd have to pull the processor registers off the stack and also look at the ones not stacked.

If the failing address is consistent, you need to look very carefully at the instructions immediately prior and with the context of the registers.

64-bit reads, or multiple reads will fault on unaligned addresses.

Have your code use a hard fault handler that outputs diagnostic information to the console, I've posted code examples to do this dozens of times.

Have your other code output telemetry and flow information, perhaps enabled by a memory variable so you can turn it on/off.

The hard fault will typically have a magic return address (ie LR = 0xFFFFFFFD) as this causes the processor to unwind the context it pushed on the stack for the event/fault. Review a manual on the core if this concept is not familiar to you.

Tips, Buy me a coffee, or three.. PayPal VenmoUp vote any posts that you find helpful, it shows what's working..

turboscrew

Senior III

The return code ib LR also tells you which stack is used for pulling the frame at return and which mode it then returns to (thread/handler).

To be on the safe side, it might be good to check the stack frame.

waclawek.jan

Super User

> We do have the simple test software build with the same drivers and it has never crashed

Was this running from the QSPI too?

> We use FreeRTOS

Have you analyzed the stacks for overflows?

JW

Jukka Lamminm�kiAuthor

Associate III

The test software is running from the QSPI FLASH and stacks are never overflown.

Ozone

Principal

For one thing, using the FPU in interrupts has certain implications (lazy stacking ?).

Have you tried a heat gun to check for a relation to temperature, or reducing clock frequency ?

If so, it might be the Flash settings.

Jukka Lamminm�kiAuthor

Associate III

We have kept the interrupts simple, no FPU used there.

External conditions seems to be not the issue, bad build crashes to the same spot with different devices. One build can manage e.g. 15 minutes, in room temperature or in Finnish spring temperatures near zero degrees C - must say though that today it is about +30C outside.

We have tried different QSPI prescalers with no effect.

Ozone

Principal

Reduced clock speed ?

Reduced optimization settings ?

Could a "stable build" provoked to hardfault at elevated temperature ?

Has elevated temperature an effect on the "MTB-hardfaults" ?

Just some ideas, when a systematical approach seems to lead nowhere...

Piranha

Principal III

Is D-cache used? If yes, try not enabling it and see if that changes anything. Though that also requires not calling relevant clean/invalidate functions, if those are used in code.

Jukka Lamminm�kiAuthor

Associate III

We have only enabled I-cache. Never got D-cache working properly

Piranha

Principal III

Are interrupt priorities correct respective to FreeRTOS requirements described here?

https://community.st.com/s/question/0D50X0000AurNZJSQ2/real-time-isr-no-interruptions-using-the-rtos-middlewares-how

Here it really has to be asked - which drivers are You using? HAL, LL or other?

Jukka Lamminm�kiAuthor

Associate III

The interrupt priorities are the following:

#define configPRIO_BITS 4 /* 15 priority levels */

#define configLIBRARY_LOWEST_INTERRUPT_PRIORITY 15

#define configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY 5

/* The lowest priority. */

#define configKERNEL_INTERRUPT_PRIORITY ( configLIBRARY_LOWEST_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )

/* Priority 5, or 95 as only the top four bits are implemented. */

#define configMAX_SYSCALL_INTERRUPT_PRIORITY ( configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY << (8 - configPRIO_BITS) )

and all the interrupt priorities we use are related to configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY:

configLIBRARY_MAX_SYSCALL_INTERRUPT_PRIORITY + n, where n is 1 ... 4

We use three UARTS, DMA based and one SPI, interrupt based

We are using HAL libraries, revision 1.2.5, 02-February-2018

Jukka Lamminm�kiAuthor

Associate III

One example just detected.

Fault analyzer of the debugger shows that the LR when entered hardfault handler points to the second if statement in the following function:

static bool ValidateNMEAChecksum(const char* nmea)

{

if (nmea[0] != '$')

{

return false;

}

const char* checksumPointer = strchr(nmea, '*');

if (!checksumPointer)

{

return false;

}

***

and PC points to the while statement in the following method:

const char* xxxx_Gnss::GetNextData(const char* data) const

{

if (!data)

{

return nullptr;

}

uint32_t i = 0;

while (data[i] != ',')

{

if (data[i] == 0)

{

return nullptr;

}

i++;

}

***

This makes no sense, since either of the functions are called from inside each other. Both are used from the same thread, first is called the ValidateNMEAChecksum() function and after that the GetNextData() method.

Piranha

Principal III

Are nmea and data variables guaranteed to always be zero-terminated?

F7 HAL has two newer versions. 1.2.6 has updates related to F730 and 1.2.7 has significant amount of improvements, especially for UART.

By the way, as it seems that You care quality... There is a practice for pointers to always be compared to == / != NULL.

https://stackoverflow.com/a/1284067

Tesla DeLorean

Guru

With @Piranha on the NUL termination, strchr() is potentially unbounded.

Instrument the code, walk the buffer, perhaps have a constrained length here, and test for it, dumping buffers which fail.

Try to establish a pattern/flow to failure, especially if it is at same place consistently.

Optimization changing behaviour often points to logic or initialization conditions in the code.

Tips, Buy me a coffee, or three.. PayPal VenmoUp vote any posts that you find helpful, it shows what's working..

Jukka Lamminm�kiAuthor

Associate III

The previous example was one-time-happening. These following two examples with different build - compiler optimization settings modified - seems to be more often occurring - though not in every run:

PC points to the line:

float hdop = atof(data);

Our code is linked starting from 0x9000 0000 and SRAM memory used is MCU's internal SRAM, 0x2000 0000.

Stack is not overrun.

And this build runs normally for minutes, and then suddenly crashes with above shown results

turboscrew

Senior III

Have you checked the contents of CFSR and HFSR?

waclawek.jan

Super User

I once chased a randomly occuring error, which turned out to be hardware - a piezoelement-generated spikes got into one of the STM32 pins through and unexpected parasitic path.

The automagically decomposed "last C line" is completely worthless in this case. Go to the disassembly and look at the last (i.e. the one before the one pointed to by the PC content stored on stack), sometimes also a couple of instructions before that. From that, the content of fault registers mentioned by turboscrew (because in F7 hardfault is rarely other than escalation of some other fault), and the content of registers (some of them stored on stack), it should be immediately clear, exactly which one of them is the culprit and why; work then back through the mixed source/disasm to find the C source and understand the underlying root problem.

But in case of wildly varying and completely unreproducible errors, this rarely leads to a solution. As Clive said, work back through the stack and try to find some recognizable patterns. RTOS makes this harder to almost impossible; that's one of the joys of using multitasking, but that was your choice.

Another approach is divide et impera: start removing things from your program, until it stops crashing, and/or gradually add until crashes again, and then try to ascertain whether the removed/added component might be the culprit.

> stacks are never overflown.

This sounds to be a very confident assertion. How do you know?

JW

Show more replies

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded