FW crashes randomly after hours - how to debug?

tjaekel · ‎2024-05-10

This is not a question, it is more an "experience report", related to find "reliability issues" (flaws) in my FW.

Many people know: there is a HardFault_Handler: called when something goes wrong (e.g. accessing a not-existing memory address, e.g. address 0 (NULL), e.g. because of an uninitialized pointer.

But the ARM Cortex MCUs have also other "traps" (INT vectors), e.g. also for MemManage_Handler, BusFault_Handler, ...

In my case:

after 1 hour or longer, the FW has crashed (randomly, a variable period of time until crash)
It has hit the "MemManage_Handler" (often not really implemented to act on it, people focus on "HardFault_Handler" only)

During Debugging my FW which has crashed (let it run with debugger connected), I came to this observation:

there was an interrupt (for USB VCP) and afterwards
the code was executed from "somewhere else" until it has hit the "MemManage_Handler"
after checking what is going on, where is the PC, what is the LR register, the SP etc.

I have realized: the code tried to execute "code" in a text buffer, intended for VCP UART transfer
(ASCII text code there, but the MCU has jumped into this buffer and tried to execute the ASCII code as instructions).

After checking some registers, esp. PC, LR and tracing a bit back, and displaying the memory content of the stack (SP register), I have realized as:

an INT happened (my USB INT for VCP UART)
after the INT handler done, code wants to return from INT:
the saved registers where "pop-ed" back from stack incorrectly: so, the PC got a value which was right in my
ASCII text buffer for sending VCP UART, text taken as instructions (executing, PC, was in an SRAM buffer)
and it has entered a bit later the "MemManage_Handler" (due to "wrong code" executed, maybe an invalide
code instruction)

This was already a clear indication, that something went wrong with the MCU stack, e.g. stack region too small.

But it became a bit difficult to investigate why stack should be too small:
Imagine: if you run any code (using also the stack), it can be interrupted at any time, so that the entry of the INT handler saves some registers on the same current stack in use and wants to restore at the end from there.

If your INT Handler is now executing code, which consumes even more stack, or does something wrong, the stack is "corrupted" and the return from ISR "runs into the trees".

In my case:

I use FreeRTOS
every thread has its own stack, and its own stack size
these stacks are allocated from a different memory pool (used by the RTOS) when a thread is initialized
but still the same fact: if a HW INT comes in - the current thread is interrupted and the current thread stack
is used for the ISR handler

It turns out, that in my case, one of my RTOS thread had a stack size assigned which was too small. No idea in which thread it happened, but it looks to me, that a thread, using heavily the thread stack, has utilized a lot of stack already. And when the INT has interrupted the thread - the remaining thread stack was too small.

It has destroyed other data, e.g. the stack of another thread, or it has reached the bottom of the stack and wrote
to a data region underneath the stack region.

Conclusion:

When you see your FW crashing in a random way, after a long time it fails (and hits a Fault_Handler) - check the stack size. In my case: check all the RTOS thread stack sizes.
Bear in mind that every stack size should have enough space left so that when a HW INT happens - it has still enough space left.
But consider also what is done during an INT, in the ISR handler: if this ISR handler uses also stack (for local, temporary variables) - the remaining stack must be even large, large enough for all the local variables used in any subsequent function call (from ISR entry, the entire call tree).

A good practice for me is this:

check every function, especially the functions called in an ISR handler,
if and how much of local variables they use/need, the size of local variables (esp. buffers and arrays)
worst case: you define a local buffer, e.g. like:

void ISR_Handler(void) {
    int MyLocalBuffer[1024];
...
}

this consumes in addition to saving the register context on stack for an ISR, also additional memory on stack!
So, often, I change such one into (using static to avoid large stack allocation):

void ISR_Handler(void) {
    static int MyBuffer[1024];
    ...
}

But make sure, that this "shared memory" (as static) can still work.

So, I check for many functions how many local variables are used, how large they are, esp. when it comes to buffers and arrays, if this buffer is "local" (allocated on the stack).
Personally: I avoid to have many and large local variables, everywhere (also in sub-functions called in a RTOS thread). The unknown call tree and how much local variables (and stack size) needed - makes it "unpredictable" when threads and INTs interrupt each other.

And check NOT just the stack size settings in the Linker Script, esp. when using RTOS - how much stack is assigned for every thread?. And consider what happens if a HW INT kicks in (asynchronously, at any time), what the ISR handler does (and how much stack it needs by itself) and increase the thread stack size by this "worst condition" (the maximum of stack needed for an INT services, interrupting code at any time).

Increasing the stack size for the main threads (running all the time), has solved my problem.
(I gave via Linker Script also a "security margin" (unused space), between my MCU stack and the memory used by FreeRTOS (also used for thread stacks).

So, a random crash is mainly caused by a "wrong" memory layout (and incorrect size of regions, esp. for stacks).
Bear in mind: with an RTOS used - you have more as just one MCU stack.

tjaekel · ‎2024-05-10

I think, AI (and ChatGPT) can never help here to solve such a problem.
You have to be smart to come up with a "working debug strategy" (with understanding what is going on your on system and in the MCU).

View solution in original post

tjaekel · ‎2024-05-10

I think, AI (and ChatGPT) can never help here to solve such a problem.
You have to be smart to come up with a "working debug strategy" (with understanding what is going on your on system and in the MCU).

Andrew Neil · ‎2024-05-14

@tjaekel wrote:
When you see your FW crashing in a random way, after a long time it fails (and hits a Fault_Handler) - check the stack size.

Indeed. And not just if you're using an RTOS - this can also happen "bare-metal".

Another common cause of such "strange" faults is buffer overrun - especially when the overrunning buffer is on the stack (a local - auto - within a function).

#BufferOverrun #Stack@verflow #StackCorruption #BizarreFaults

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

Uwe Bonnes · ‎2024-05-15

Tracing via the trace functionality can also be of help => Orbuculum or commercial vendor tools

unsigned_char_array · ‎2024-05-15

I've have an idea to solve this problem. I've had this idea for a while, but since I'm not doing anything with FreeRTOS at the moment I haven't implemented it.

I've had stack overflows in many projects with FreeRTOS. It's the standard library that gave me the most problems: snprintf, cout and regex have caused stack overflows for me. For snprintf I would assume at least 2k of stack usage.

You write a pre-build and post-build script to gather stack information and change defines or const ints of required stack sizes. Then build again if stack size has changed:

GCC provides an option to export stack usage "-fstack-usage"
this doesn't work with recursion, but you simply shouldn't use recursion or limit recursion with a constant and use that constant to calculate max stack usage for that recursive function
it doesn't work for all functions, but you could require the user to supply an estimated worst case scenario of missing information
it doesn't work with function pointers, but you could supply an estimated worst case scenario for that or write a more complicated script that takes the maximum stack usage of a list of possible functions called by the function pointer
it doesn't work with dynamically spawned threads, so avoid those
you need to assume the stack usage of the worst case interrupt gets added to every thread. If you use nested interrupts this can be more complicated, but you can always add up all stack usages of all interrupts
all the stack usage is compared to the previously calculated stack usage, if it changed (perhaps add a margin for required decrease) you need to rebuild.
when you rebuild you either use defines or const ints for stack size. If a define changes it will force rebuild and if a const int changes it will force rebuild of only 1 file and it will only have to link again. I don't know which would be better.

This would be a lot of work. A more simple version would do the following:

Same as above except the last two steps are just comparing the defined stack size with the calculated stack size. You need to scan the code for certain defines or constants that define stack size per thread, but you can standardize the naming convention so that parsing is easier.

This is still a lot of work, but I think it is doable. In automated builds this can be part of automated tests. I think this would prevent a lot of bugs. It may not catch all stack overflows if not perfectly implemented, but it will find some, which is worth it in my opinion.

Kudo posts if you have the same problem and kudo replies if the solution works.
Click "Accept as Solution" if a reply solved your problem. If no solution was posted please answer with your own.