2024-06-23 11:51 PM
Working on a product-based STM32F203, getting multiple ARM exceptions of all kinds, Bus Fault, Usage Fault, Hard fault, and Mem fault. The issue is I have a year-old debug Log for debugging purposes. I can't recreate any ARM exception to check the stack values. So the only way to debug is only through old logs, I just wanted to know which areas of code to look for these kinds of exceptions.
we are using an external PEG library for UI implementation. here are few logs that I have
ERR ARM BUS FAULT EXCEPTION [REBOOT] nBFSR=82, nBFAR=20FFA1FC, ReturnAddr =080394BE, type= Precise Data Access violation, thread = Eng Debug Log
ERR ARM HARD FAULT EXCEPTION [REBOOT] HFSR = 80000000 ReturnAddr = 0806717C type =Debug Event
ERR ARM USAGE FAULT EXCEPTION [REBOOT] UFSR=0004, ReturnAddr = 08039588 , type= EXC_RETURN thread= NvDataManager
I tried to trace the values of HFSR, UFSR, and BFAR via Map files but I'm getting nothing, there is no way to debug via debug session as these exceptions are occurring randomly we didn't find any patterns here so we are unable to recreate and stuck here.
any help or suggestion to look for will be highly appreciated.
2024-06-24 01:11 AM
You don't tell us much about the program that is triggering these faults - what language it was written in (we assume C or perhaps C++), or whether you are using a RTOS.
My first suspicion would be your software, but it is possible that it is a "hardware" issue - glitches or brown-out on power-rails, ESD events; clocking too fast for a given supply-voltage + FLASH-wait-state combination. Do revisit that and check the errata. And double-check that you have adequate power-supply-decoupling capacitors immediately adjacent to every Vdd, Vss pair. Violate the data-sheet's "operating conditions" and all bets are off...
I see all returnAddr you show seem to be in FLASH (0x080yyyyy).
A major trouble is that the software bug (if there is one) might cause corruption of a bit of memory that is used elsewhere (perhaps a return-address on a stack, or the address of a variable stored in a pointer); only when you use that now-corrupt value does the exception occur so it might not give you an immediate clue as to what's going on until you notice that this value is just after the end of a buffer that is overflowing.
If software, and you're using C/C++, then it is very easy to slip up in memory handling.
Does your code use dynamic memory (malloc/calloc/free; new/delete)? If so, how do you handle the inevitable case when memory-fragmentation causes an allocation to fail. You do check for out-of-memory on every allocation, don't you? How do you recover?
Of course your PEG library might do such allocations, hidden from you. Assuming there are no lurking bugs in there, you do check their replies every time as well? Is PEG thread-safe, or are you sure to only call it from one thread?
And what about passing a pointer to an automatic variable (allocated on the stack) that goes out-of-scope as soon as you return from the subroutine?
Or accessing beyond either end of an array?
Are you sure you give enough space for each and every thread's stack? (Do you ever allocate variable-sized-arrays on a stack?)
Of course you need good version-control so you can be sure that the version of source-code you're looking at is the version that caused the exception!
But no this kind of post-mortem is not easy in my opinion.