Undefined Instruction Hard Fault

patrickwright · ‎2023-11-10

I am using an STM32L562VET6Q. My device contains both a (custom) bootloader and an application. The bootloader jumps into the application by reading the address of the Reset_Handler from the application's ISR vector then branching to this address. I have had absolutely no issues with this until the other day. I performed a firmware upgrade on a group of devices after which about 10% ceased responding. I placed a few of these on a debugger and they all, after branching to the application, throw an Undefined Instruction Hard Fault at the same instruction. This is not happening immediately at the start of the application but after hundred of instructions in. Keep in mind, this exact same firmware is running without issue on the other 90%. I have performed the following tests:

I downloaded (via the debugger) the contents of the flash memory and compared it to the image I uploaded. They match byte-for-byte (which is what I expected since the bootloader hash check passed)
I re-flashed a failed chip with the image using the debugger. The device still throws the same fault
I looked at the PC pushed into the stack by the fault handler to see at which address the fault was occurring. Looking at the list file, the instruction at this address is perfectly valid.
I downloaded the complete contents of a faulting device's flash, replaced the MCU on the board with a new MCU, then flashed this new chip with the image I cloned. The chip now boots without faulting.
I replaced the MCU on a working board with the one I removed in (4). Now this board faults. This leads me to the conclusion that this fault is related to the MCU itself and not the surrounding hardware.
I added a DSB and ISB instruction right before the branch to the application (thinking that the pipeline might hold a bad instruction after the branch). This made no difference.
Both the bootloader and application have the instruction cache enabled. I disabled the instruction cache in the bootloader (without changing the application image). The device now boots without issue.
I had the debugger break just before the address at which the fault occurs in the application and stepped through each instruction. When stepping, the fault does not occur and the application boots normally.

At this point, I am forced to the conclusion that something in internally wrong with specific chips in combination with the specific sequence of instructions/addresses that are being executed. Based upon my testing, it appears to be related to the instruction cache. I believe the instruction cache is somehow becoming "corrupted."

I have looked through the errata for this chip, but have not found any related entries to this issue. I would appreciate any further debugging steps I can follow to narrow down this issue.

STOne-32 · ‎2023-11-10

Dear @patrickwright ,

We already escalated your case to our local FAE team to have a close contact with you. In mean time Here are few suggestions /hypothesis :

1) It might be timing issues also linked to code alignment at memory , if is working on 90% of devices, you can play with temperature , if you can increase or decrease it in a chamber and see if the percentage will remain the same either for the 90% or also for the 10% if it become good.

2) If possible to have the dump of the ICACHE- register map when the Fault is triggered and compare it with a good device.

3) if Low power modes are used such as entering STOP mode or Sleep etc.

Have a good day,

STOne-32

Tesla DeLorean · ‎2023-11-10

Unpack the clocking information, PLL, buses, etc. Make sure the VCO isn't clocking too fast

Check Flash wait states and VOS settings related to the MCU clocking rate.

Check VCAP voltages and capacitors, failure here can cause issues with fast execution, flash, and flash erase/writing.

Perhaps chat with your local sales or support engineer, or FAE. If this is a faulty part they might be able to RMA and do failure analysis. Pull Unique ID registers from part, and any trace codes on related packaging.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

patrickwright · ‎2023-11-10

In response to your suggestions:

1) I should be able to test this in a temperature chamber, but it may take some time to get everything set up.

2) How do I generate a dump of the ICACHE? I thought this was internal to the core so I wasn't aware there was a way to "inspect" its contents.

3) I use low power modes, but, at the point in execution (during the initialization sequence) at which the fault occurs, low power mode has not been used yet.

Thank You!

patrickwright · ‎2023-11-10

I am using a clock speed of 110MHz and voltage scale 0. According to Table 32 in the reference manual, I should have the latency set to 5. I have verified that the bootloader is setting the latency to 5 (I verified the code and read out the ACR register using the debugger) before switching the clock speed to 110Mhz.

STOne-32 · ‎2023-11-10

Here are the ICACHE register map for the configuration and status bits

David Littell · ‎2023-11-10

Random thoughts:

- Does the CubeProgrammer fault analysis give any additional information?

- Your bootloader invalidates the instruction cache before enabling it?

- Maybe set a higher latency than what Table 32 suggests?

- Date code/lot differences between the 90% and the failing 10%?

Pavel A. · ‎2023-11-10

@patrickwright Regarding ICACHE - how the MCU "knows" which code is the bootloader and which code is application? Does the application repeat enabling ICACHE, when it is already enabled by bootloader? Does the bootloader disable and invalidate ICACHE before jumping to the application?

patrickwright · ‎2023-11-10

I invalidate the ICACHE (but still leave it enabled) before branching into the application. And yes, the application re-enables the ICACHE.

::__DSB();
::__ISB();
LL_APB1_GRP1_EnableClock(LL_AHB1_GRP1_PERIPH_ICACHE);
LL_ICACHE_Invalidate();
intImage->image.reset_handler( );

Even with invalidation, the fault still occurs.

However, since the application is just at a different set of addresses than the bootloader, I do not see why the MCU needs to "know" that it is in the bootloader or application. From the perspective of the MCU, it is just a branch to a different part of the code.

Pavel A. · ‎2023-11-10

And the image.reset_handler(), does it repeat enabling ICACHE? Can it be that repeated enabling ICACHE causes the problem? Invalidate() does not really matter, the cache immediately fills again. Note that CMSIS functions for ICACHE are provided by ARM; the ST version in the Cube libraries is not the latest.