FLASH ECC Codes Cause Bus Fault on STM32H743

jaakjensen · ‎2022-08-26

Hello,

For a few months now I have been having issues with writing data to flash memory on the STM32H743ZIT6. Most of the time, everything works great and I am able to read the data from flash successfully, but every now and again, the flash memory gets corrupted somehow and ECC codes are thrown (both single and double), which causes my device to have a bus fault.

I am running the flash peripheral / AXI bus at 240Mhz and I have the flash wait states set to 4 WS (5 Flash clock cycles). I have followed all the guidelines related to the HW design and have the correct core capacitance on the VCAP pins. I write 256 bits of data to the flash memory approximately every 10 seconds to save the state of my device (of course I increment the write address after each write so that I'm not writing to the same position every time). When the sector is filled, it is then erased and then I start writing to the start of the sector. I also set the BOR bits to the highest voltage setting to try and prevent brown-out issues. I think these are most of the important settings you need to know.

Today while looking at my register settings in debug mode I noticed that the WRHIGHFREQ setting was set to 3 (aka 11) by default... I can't find anywhere in the HAL / code where this is done so it must be set automatically. The manual only lists valid settings for 0, 1, and 2 (see below). Can anyone tell me what the behavior is of the STM32H743ZIT6's flash module is when a setting of 3 is used for WRHIGHFREQ? Is it just invalid / undefined? Maybe this is my issue?

Does anyone have any ideas?

FBL · ‎2022-09-16

Hello @jaakjensen

Can you try

1- Check and clear the RDS and RDP errors prior to the erase/ write operations

2- Disable all interrupts before erase and program.

Maybe when debugging, the Cortex is trying to access memory. So it could reach reserved zone and it could result in an error which occurs only when accessing RDP protected area, so maybe this makes sense

I have found some related posts that could help you

https://community.st.com/s/question/0D50X0000BaKiDBSQ0/spurious-rdperr-and-rdserr-when-all-protection-and-security-settings-are-off?t=1663333087127

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

View solution in original post

Pavel A. · ‎2022-08-27

Check also how you erase. Is the "voltage range" parameter good?

jaakjensen · ‎2022-08-27

I think it is "good". I am using a voltage range of 4 AKA a programming parallelism of double-word. I assume this is OK. There is no documentation in the reference manual about which is "correct" - it is just a tradeoff between timing and power consumption.

jaakjensen · ‎2022-08-27

I should also note that I am using the HAL_FLASH_Program() function to carry out my programming.

jaakjensen · ‎2022-08-29

Does anyone else have feedback on this topic? Still seeking advice.

FBL · ‎2022-08-30

Hi @jaakjensen

Please check the Flash register FLASH_ACR in Reference Manual Reset value: 0x0000 0037. So 3 stands for WRHIGHFREQ (Bits 5:4) and 7 stands for Latency

Also note that the application software has to program them to the correct value depending on the embedded Flash memory interface frequency.

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

jaakjensen · ‎2022-08-30

Hi @F.Belaid

Thank you for the response. I forgot that all registers have a default reset value that is defined in the reference manual. Thank you for pointing that out. Still, the reset value, 3, is not a valid setting for WRHIGHFREQ for any clock frequency on STM32H743 according to Table 17 in RM0433, which I think is strange.

In addition, neither the HAL or STM32CUBEMX platform handle setting this value, even though they handle the LATENCY - the auto generated code for 240Mhz AXI Bus sets the LATENCY to the recommended value but does not set the WRHIGHFREQ.

Unless the user reads the reference manual and happens to see Table 17 and then goes digging in the FLASH LL drivers, they would probably not know how to set this setting.

Two questions:

Do you know who I could report this issue to about the HAL / STM32CUBEMX platform not handling the WRHIGHFREQ settings?
Could WRHIGHFREQ being set to the incorrect value lead to ECCs being tripped?

FBL · ‎2022-08-31

Hi again @jaakjensen

To continue investigating the issue you are facing, I have some more questions and proposals:

1- Is ECC error related to last written word or is faced at a random address?

2- How long this default has been showing up? Were you erasing/writing the flash for a long period of time? Just remember to check the memory characteristics to not exceed the Flash memory endurance and data retention (refer to table 151 in product datasheet )

If you think that you exceeded maximum allowed values, it is recommended to use Backup SRAM for longer lifetime.

3- My understanding is that a value set to 3 for WRHIGHFREQ shouldn't create an issue. The reset value covers all intended frequencies in the table but in larger latency. Can you change WRHIGHFREQ to the value 0x2 for example? If you confirm that updating this bitfield resolves the issue, we need to investigate this on our side and report it internally for HAL and CubeMX implementation.

4- Is issue faced with only one sector all the times? If yes, does it appear if you use another sector?

5- Can you try to erase sector registers where you are having this error using CubeProgrammer ?

6- Please make sure to follow properly the procedure for sector erasing in your application as described in the section "4.3.10 FLASH erase operations" in Reference Manual

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

jaakjensen · ‎2022-08-31

It's not clear when it happens. When my product powers up for the first time, it erases the flash sector where I've decided to store data (bank 2, sector 6) and then it writes the first 256 bits of data to the start of the sector. Every ten seconds I increment the write address by 256 bits and then write 256 bits again. I repeat this process while the device is powered on to save various state settings. When the sector fills up with data and the write address reaches the address of sector 7, it erases sector 6 and resets the write address to the start of sector 6.
This has been showing up for 4 months. It just happened last week on a brand new unit that had under 10 hours of use. Using my current approach where I increment, then write, the flash memory should last: ( ( 128 kbytes per sector ) / (256 bits) ) * ( 10 seconds / write) * (10,000 flash cycle endurance) = 12 years to wear out. And that's only if the device is powered on 24 hours a day, which it isn't.
I have changed my WRHIGHFREQ setting to 0x2 and issued a firmware update to all our users. No issues yet but it has only been 5 days. This issue is unpredictable. It has happened to me personally 2 or 3 times over the last 4 months and only 4 times to the 40 units we have in the field.
It only happens to the sector where I am saving state data
Yes, I can erase the sector using the CubeProgrammer or by implementing an erase feature in the BusFault Handler, which clears the error. This is not an acceptable option though - important user data is stored in this sector and it should not be getting corrupted after less than 10 hours of use.
I am using HAL_FLASHEx_Erase() to perform the erase procedure. Reviewing the code, this seems to follow all the steps in 4.3.10 for the flash sector erase process. I use FLASH_VOLTAGE_RANGE_4 when this step is executed.

jaakjensen · ‎2022-09-04

One of the units was updated to the latest firmware today (with the WRHIGHFREQ set to 0x2). It was working great for a number of hours, but then after a power cycle the “state�? memory threw ECC codes and then the unit failed to boot. Erasing the sector allowed it to boot again - but this isn’t a viable solution.

I am still struggling to figure out why this keeps happening.