2023-11-10 07:05 AM
I am using an STM32L562VET6Q. My device contains both a (custom) bootloader and an application. The bootloader jumps into the application by reading the address of the Reset_Handler from the application's ISR vector then branching to this address. I have had absolutely no issues with this until the other day. I performed a firmware upgrade on a group of devices after which about 10% ceased responding. I placed a few of these on a debugger and they all, after branching to the application, throw an Undefined Instruction Hard Fault at the same instruction. This is not happening immediately at the start of the application but after hundred of instructions in. Keep in mind, this exact same firmware is running without issue on the other 90%. I have performed the following tests:
At this point, I am forced to the conclusion that something in internally wrong with specific chips in combination with the specific sequence of instructions/addresses that are being executed. Based upon my testing, it appears to be related to the instruction cache. I believe the instruction cache is somehow becoming "corrupted."
I have looked through the errata for this chip, but have not found any related entries to this issue. I would appreciate any further debugging steps I can follow to narrow down this issue.
2023-11-10 08:58 AM
Dear @patrickwright ,
We already escalated your case to our local FAE team to have a close contact with you. In mean time Here are few suggestions /hypothesis :
1) It might be timing issues also linked to code alignment at memory , if is working on 90% of devices, you can play with temperature , if you can increase or decrease it in a chamber and see if the percentage will remain the same either for the 90% or also for the 10% if it become good.
2) If possible to have the dump of the ICACHE- register map when the Fault is triggered and compare it with a good device.
3) if Low power modes are used such as entering STOP mode or Sleep etc.
Have a good day,
STOne-32
2023-11-10 09:04 AM
Unpack the clocking information, PLL, buses, etc. Make sure the VCO isn't clocking too fast
Check Flash wait states and VOS settings related to the MCU clocking rate.
Check VCAP voltages and capacitors, failure here can cause issues with fast execution, flash, and flash erase/writing.
Perhaps chat with your local sales or support engineer, or FAE. If this is a faulty part they might be able to RMA and do failure analysis. Pull Unique ID registers from part, and any trace codes on related packaging.
2023-11-10 09:13 AM
In response to your suggestions:
1) I should be able to test this in a temperature chamber, but it may take some time to get everything set up.
2) How do I generate a dump of the ICACHE? I thought this was internal to the core so I wasn't aware there was a way to "inspect" its contents.
3) I use low power modes, but, at the point in execution (during the initialization sequence) at which the fault occurs, low power mode has not been used yet.
Thank You!
2023-11-10 09:22 AM
I am using a clock speed of 110MHz and voltage scale 0. According to Table 32 in the reference manual, I should have the latency set to 5. I have verified that the bootloader is setting the latency to 5 (I verified the code and read out the ACR register using the debugger) before switching the clock speed to 110Mhz.
2023-11-10 09:44 AM
Here are the ICACHE register map for the configuration and status bits
2023-11-10 11:11 AM
Random thoughts:
- Does the CubeProgrammer fault analysis give any additional information?
- Your bootloader invalidates the instruction cache before enabling it?
- Maybe set a higher latency than what Table 32 suggests?
- Date code/lot differences between the 90% and the failing 10%?
2023-11-10 11:35 AM
@patrickwright Regarding ICACHE - how the MCU "knows" which code is the bootloader and which code is application? Does the application repeat enabling ICACHE, when it is already enabled by bootloader? Does the bootloader disable and invalidate ICACHE before jumping to the application?
2023-11-10 12:01 PM
I invalidate the ICACHE (but still leave it enabled) before branching into the application. And yes, the application re-enables the ICACHE.
::__DSB();
::__ISB();
LL_APB1_GRP1_EnableClock(LL_AHB1_GRP1_PERIPH_ICACHE);
LL_ICACHE_Invalidate();
intImage->image.reset_handler( );
Even with invalidation, the fault still occurs.
However, since the application is just at a different set of addresses than the bootloader, I do not see why the MCU needs to "know" that it is in the bootloader or application. From the perspective of the MCU, it is just a branch to a different part of the code.
2023-11-10 12:30 PM
And the image.reset_handler(), does it repeat enabling ICACHE? Can it be that repeated enabling ICACHE causes the problem? Invalidate() does not really matter, the cache immediately fills again. Note that CMSIS functions for ICACHE are provided by ARM; the ST version in the Cube libraries is not the latest.