2024-07-11 07:31 AM
Hello,
We have a device that runs an stm32g0. I have a bootloader at the beginning of the flash and then 2 pages for config and the rest of the flash for the firmware. The bootloader basically checks if the two config pages are OK and then starts the firmware. I have now 3 decives that suddenly stopped working and all have the same issue:
The device starts in bootloader, prints some messages (UART) and then freezes and restarts when the watchdog times out. Based on the messages I see at the output, I know that the only thing the bootloader does before freezing is reading the two config flash pages (copying the content to a buffer with memcpy).
I see two possibilities:
1) the bootloader (code in flash) is somehow corrupt, however I've write protected the flash where the bootloader rests...
2) reading the internal flash (config pages) creates an interrupt (hard fault maybe?) and the bootloader is stuck in the busy loop of the interrupt handler (I have no output there unfortunately).
The devices have readout protection enabled (level 1 read protection) so I cannot reflash a modified bootlader to debug the problem. I did remove the readout protection of one device and reflashd it and it worked without problems afterwards. I've now done a stress test on that device by savig the config pages many times (now already over 30'000 times and it works without any problems)...
Any hints what I could do to find the problem?
Solved! Go to Solution.
2024-07-15 03:55 AM
Thanks a lot for your help!
Erasing a flash page and then reading it before writing does not cause the ECC double bit error, I've tested it multiple times (I've also erased the page and powered the device off, it still works the next power on).
A reset or power loss during the erase/write however can cause the ECC error sometimes. I've produced the error on a new device by resetting it during erase/write and then modified the bootloader by adding
if (READ_BIT(FLASH->ECCR, FLASH_ECCR_ECCD) != 0U) {
SET_BIT(FLASH->ECCR, FLASH_ECCR_ECCD);
return;
}
to the NMI_Handler to ignore the error and then I was able to read and correct the config page and startup normally.
2024-07-11 09:40 AM
>>Any hints what I could do to find the problem?
Provide a back-door so you can inspect the memory.
Add better diagnostics to Error_Handler() and HardFault_Handler(), and any other while(1) loops of silent death you have, so you know how/where your implementation died.
Add more check-points and GPIO/LED output so you can determine how far into the code it gets before failure.
Have strapping options to get more telemetry and POST info.
Be careful modifying/erasing sectors in situations where imminent power failure might occur.
2024-07-11 10:09 AM
I see other possibilities, too:
3. bootloader wants to do something else *after* reading the config pages, but that something fails (we don't know what the bootloader is intended to do, surely not just checking some config)
4. bootloader starts application, but the application faults or falls to loop early on, before taking any outward action
Now what to do.
I'd put the remaining two failed board aside until you getting confident in what actions can be taken, and experiment on the third (or any other) boards, by deliberately introducing errors of kind 1, 2, 3, 4 as described above, or any other, and locking them to Level 1, and then trying to find out what can be observed on those boards (e.g. state of various pins, power consumption changes, ?). I believe even in RDP Level 1 you still can connect and read out the processor registers and probably also some other areas, too.
JW
2024-07-11 01:07 PM
Hello,
3) and 4) are not possible because there would be additional output visible which is not: After I read the config and do some CRC checksum checks I report whether the config is valid and I do not see that output. The checksum checks do not have anything that could possibliy fail, so it must be either 1) or 2).
Is 1) (corrupt bootloader) even possible when I have protected that area?
2024-07-11 01:16 PM
This is how I read the two flash pages
#define FLASH_BASE_ADDRESS 0x08000000
#define CONFIG_1_ADDRESS 0x00007800
#define CONFIG_2_ADDRESS 0x00008000
#define CONFIG_SIZE 2048
#define FLASH_PAGE_1 (FLASH_BASE_ADDRESS | CONFIG_1_ADDRESS)
#define FLASH_PAGE_2 (FLASH_BASE_ADDRESS | CONFIG_2_ADDRESS)
__ALIGN_BEGIN static uint8_t g_mem_buffer_1[CONFIG_SIZE] __ALIGN_END;
__ALIGN_BEGIN static uint8_t g_mem_buffer_2[CONFIG_SIZE] __ALIGN_END;
void loadPage(uint32_t address, uint8_t* data, uint32_t size)
{
memcpy(data, (const void*)address, size);
}
...
loadPage(FLASH_PAGE_1, g_mem_buffer_1, sizeof(g_mem_buffer_1));
loadPage(FLASH_PAGE_2, g_mem_buffer_2, sizeof(g_mem_buffer_2));
...
2024-07-12 02:37 AM
What do you mean with
Be careful modifying/erasing sectors in situations where imminent power failure might occur.
?
I know that one page could be garbage when power failure occurs during the erase/write, thatswhy I have two pages and restore the faulty one if necessary. Could the power failure during erase/write affect other pages as well?
2024-07-12 04:51 AM
As I've said above, try to attack the problem from observing the behaviour - connect with debugger and observe at least the processor registers (namely PC).
JW
2024-07-12 08:12 AM
Yesterday I've flashed the normal bootloader and a special firmware that write the config pages multiple times (as described above). I've written the config pages about 50'000 times (with a reset after every 100 writes, the bootloader always started up normally and loaded the firmware) and shut down the device afterwards.
Today, the device had the exact same problem (freeze in bootloader), but this time, read out protection was off so I could read the whole flash content. I've programmed that exact flash content onto a new device and that one works without problems! I've also checked that all Option bytes were the same...
I've then attached a debugger and the cpu is in NMI_Handler() called during memcpy in the code I've shown above. The CCS SFR has the flags UNALIGN__TRP and STKALIGN set.
How can the exact same code have this problem on one device and none on another one? I'm quite sure if I reflash the faulty one with the same code again it will work as well.
2024-07-14 02:47 PM
The FLASH in 'G0 is ECC protected. I guess you are trying to read erased-but-not-programmed double-words, or, maybe worse, you are overwriting some double-words twice.
JW
2024-07-14 03:49 PM
I always erase the page before i write it. Of course it is possible that the device loses power or resets (BOR) after the erase and before the write. So I'll have to check the page if it is just erased and not written before trying to read it? How?