FLASH ECC Codes Cause Bus Fault on STM32H743

jaakjensen · ‎2022-08-26

Hello,

For a few months now I have been having issues with writing data to flash memory on the STM32H743ZIT6. Most of the time, everything works great and I am able to read the data from flash successfully, but every now and again, the flash memory gets corrupted somehow and ECC codes are thrown (both single and double), which causes my device to have a bus fault.

I am running the flash peripheral / AXI bus at 240Mhz and I have the flash wait states set to 4 WS (5 Flash clock cycles). I have followed all the guidelines related to the HW design and have the correct core capacitance on the VCAP pins. I write 256 bits of data to the flash memory approximately every 10 seconds to save the state of my device (of course I increment the write address after each write so that I'm not writing to the same position every time). When the sector is filled, it is then erased and then I start writing to the start of the sector. I also set the BOR bits to the highest voltage setting to try and prevent brown-out issues. I think these are most of the important settings you need to know.

Today while looking at my register settings in debug mode I noticed that the WRHIGHFREQ setting was set to 3 (aka 11) by default... I can't find anywhere in the HAL / code where this is done so it must be set automatically. The manual only lists valid settings for 0, 1, and 2 (see below). Can anyone tell me what the behavior is of the STM32H743ZIT6's flash module is when a setting of 3 is used for WRHIGHFREQ? Is it just invalid / undefined? Maybe this is my issue?

Does anyone have any ideas?

FBL · ‎2022-09-05

Hi @jaakjensen

Here are more proposals :

1- log start & end of erase, start & end of program and start & end of reset. Also, monitor TimeStamp to follow up

2- I suggest to reduce values of BOR and see the impact

Hoping to resolve your issue

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Best regards,
FBL

jaakjensen · ‎2022-09-05

Thanks so much for the response.

I will try these things and report back.

jaakjensen · ‎2022-09-13

1 - I am investigating this today and taking timing measurements. Can you clarify what you mean by "start and end of reset"?

In-addition - I am a little confused about why timing assessments are required if I am using the HAL, which has built in "FLASH_WaitForLastOperation()" functions before and after writing and erasing.

FBL · ‎2022-09-14

1- You can log a word via UART for example to indicate each startup. However, I think the start and end of reset is only possible via Oscilloscope for example to follow the rising and falling edge.

2- Timing will help you to identify if there is a code stuck.

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Best regards,
FBL

jaakjensen · ‎2022-09-14

Hi @F.Belaid I have made an interesting discovery regarding this issue. I discovered it while analyzing the timing of the flash erase and write requests.

I am noticing that sometimes (it seems) a "state" write fails and we are left with an unwritten block in memory. I don't know why it fails yet but I noticed that there is an unwritten 256 bit block in memory. The system assumes that this previous write was successful and starts writing the next block 10 seconds later.

During the next bootup, the program scans for the last 256 bit block in memory with the first word being "STAT" in ASCII. It sees this blank block where a write failed and assumes that the block before it is the last successful saved state. It then starts writing its state from here. It will eventually overwrite the following two 256 bit blocks, which cause an ECC code to trip. It sometimes happens immediately when the following blocks are overwritten or during the next power cycle.

Now I just have to figure out what is causing the block writing failure.

jaakjensen · ‎2022-09-14

Alright, so I have also narrowed down WHEN it occurs. It seems to be related to an erase error.

To give some more context about my program: there is a high priority interrupt that goes off every 1 msec, which is used to process a bunch of data supplied by the DMA. When the processing is done (typically after 0.6 msecs) we flush the cache and then release the CPU to handle lower priority tasks such as writing the state and updating the UI in the while() loop.

When things work as expected, the erase of the sector takes >4 seconds to complete. I just noticed it blocks the high priority interrupt that handles processing of the data... this is strange and I don't know how it is able to do this.Sometimes though, the erase is very brief (see below) and the high priority interrupt is not blocked. The P0 timing marker represents the "erase" while the P1 timing marker represents the "write". It is often during this situation when the failed write occurs.

jaakjensen · ‎2022-09-14

It seems during the second scenario, an HAL_ERROR is returned by HAL_FLASHEx_Erase().

jaakjensen · ‎2022-09-14

The error reported in HAL_FLASHEx_Erase() is from this section. I modified the return status to confirm which section reported the error.:

jaakjensen · ‎2022-09-14

It seems like this error may be from a "RDPERR2" ? The PFLASH error code returned is 0x8080_0000. Which is strange because I have not set up a PCROP-protected word or a RDP-protected area.

jaakjensen · ‎2022-09-14

I can confirm that both bank 1 and bank 2 have RDP settings of 0xAA aka Level 0 protection.