2024-05-10 06:49 PM
This is not a question, it is more an "experience report", related to find "reliability issues" (flaws) in my FW.
Many people know: there is a HardFault_Handler: called when something goes wrong (e.g. accessing a not-existing memory address, e.g. address 0 (NULL), e.g. because of an uninitialized pointer.
But the ARM Cortex MCUs have also other "traps" (INT vectors), e.g. also for MemManage_Handler, BusFault_Handler, ...
In my case:
During Debugging my FW which has crashed (let it run with debugger connected), I came to this observation:
I have realized: the code tried to execute "code" in a text buffer, intended for VCP UART transfer
(ASCII text code there, but the MCU has jumped into this buffer and tried to execute the ASCII code as instructions).
After checking some registers, esp. PC, LR and tracing a bit back, and displaying the memory content of the stack (SP register), I have realized as:
This was already a clear indication, that something went wrong with the MCU stack, e.g. stack region too small.
But it became a bit difficult to investigate why stack should be too small:
Imagine: if you run any code (using also the stack), it can be interrupted at any time, so that the entry of the INT handler saves some registers on the same current stack in use and wants to restore at the end from there.
If your INT Handler is now executing code, which consumes even more stack, or does something wrong, the stack is "corrupted" and the return from ISR "runs into the trees".
In my case:
It turns out, that in my case, one of my RTOS thread had a stack size assigned which was too small. No idea in which thread it happened, but it looks to me, that a thread, using heavily the thread stack, has utilized a lot of stack already. And when the INT has interrupted the thread - the remaining thread stack was too small.
It has destroyed other data, e.g. the stack of another thread, or it has reached the bottom of the stack and wrote
to a data region underneath the stack region.
Conclusion:
When you see your FW crashing in a random way, after a long time it fails (and hits a Fault_Handler) - check the stack size. In my case: check all the RTOS thread stack sizes.
Bear in mind that every stack size should have enough space left so that when a HW INT happens - it has still enough space left.
But consider also what is done during an INT, in the ISR handler: if this ISR handler uses also stack (for local, temporary variables) - the remaining stack must be even large, large enough for all the local variables used in any subsequent function call (from ISR entry, the entire call tree).
A good practice for me is this:
void ISR_Handler(void) {
int MyLocalBuffer[1024];
...
}
void ISR_Handler(void) {
static int MyBuffer[1024];
...
}
But make sure, that this "shared memory" (as static) can still work.
So, I check for many functions how many local variables are used, how large they are, esp. when it comes to buffers and arrays, if this buffer is "local" (allocated on the stack).
Personally: I avoid to have many and large local variables, everywhere (also in sub-functions called in a RTOS thread). The unknown call tree and how much local variables (and stack size) needed - makes it "unpredictable" when threads and INTs interrupt each other.
And check NOT just the stack size settings in the Linker Script, esp. when using RTOS - how much stack is assigned for every thread?. And consider what happens if a HW INT kicks in (asynchronously, at any time), what the ISR handler does (and how much stack it needs by itself) and increase the thread stack size by this "worst condition" (the maximum of stack needed for an INT services, interrupting code at any time).
Increasing the stack size for the main threads (running all the time), has solved my problem.
(I gave via Linker Script also a "security margin" (unused space), between my MCU stack and the memory used by FreeRTOS (also used for thread stacks).
So, a random crash is mainly caused by a "wrong" memory layout (and incorrect size of regions, esp. for stacks).
Bear in mind: with an RTOS used - you have more as just one MCU stack.
Solved! Go to Solution.
2024-05-10 06:51 PM
I think, AI (and ChatGPT) can never help here to solve such a problem.
You have to be smart to come up with a "working debug strategy" (with understanding what is going on your on system and in the MCU).
2024-05-10 06:51 PM
I think, AI (and ChatGPT) can never help here to solve such a problem.
You have to be smart to come up with a "working debug strategy" (with understanding what is going on your on system and in the MCU).
2024-05-14 10:18 AM
@tjaekel wrote:When you see your FW crashing in a random way, after a long time it fails (and hits a Fault_Handler) - check the stack size.
Indeed. And not just if you're using an RTOS - this can also happen "bare-metal".
Another common cause of such "strange" faults is buffer overrun - especially when the overrunning buffer is on the stack (a local - auto - within a function).
#BufferOverrun #Stack@verflow #StackCorruption #BizarreFaults
2024-05-15 12:51 AM
Tracing via the trace functionality can also be of help => Orbuculum or commercial vendor tools
2024-05-15 01:37 AM - edited 2024-05-15 04:20 AM
I've have an idea to solve this problem. I've had this idea for a while, but since I'm not doing anything with FreeRTOS at the moment I haven't implemented it.
I've had stack overflows in many projects with FreeRTOS. It's the standard library that gave me the most problems: snprintf, cout and regex have caused stack overflows for me. For snprintf I would assume at least 2k of stack usage.
You write a pre-build and post-build script to gather stack information and change defines or const ints of required stack sizes. Then build again if stack size has changed:
This would be a lot of work. A more simple version would do the following:
Same as above except the last two steps are just comparing the defined stack size with the calculated stack size. You need to scan the code for certain defines or constants that define stack size per thread, but you can standardize the naming convention so that parsing is easier.
This is still a lot of work, but I think it is doable. In automated builds this can be part of automated tests. I think this would prevent a lot of bugs. It may not catch all stack overflows if not perfectly implemented, but it will find some, which is worth it in my opinion.