2024-09-24 10:34 AM
Overview
At my job, I was recently tasked with handling ECC errors in both RAM and flash memory. Additionally, I needed to test the error handling by injecting ECC errors. Figuring out how to do this was a nightmare. I looked over AN5342 countless times, but it was pretty vague. I finally figured out how to do it, and wanted to make this post to help anyone else in the same situation.
This post will cover how I set up callbacks for both RAM and flash ECC, as well as how to trigger ECC errors. I am sure there are better ways to do this, and am open to suggestions.
Flash ECC Error Handling
1. In the .IOC file, go to system core -> NVIC and enable the flash global interrupt.
2. Navigate to Core -> Inc -> stm32h7xx_hal_conf.h and add:
#define USE_FLASH_ECC 1U
3. I found that when a double bit ECC error occurs in flash, the hard fault handler is called instead of the flash IRQ handler. To get around this, I added a check in the hard fault handler to see if a double bit error occurred in flash:
void HardFault_Handler(void)
{
/* USER CODE BEGIN HardFault_IRQn 0 */
// Check if a double-bit ECC error has occurred
if ((FLASH->SR1 & FLASH_SR_DBECCERR) || (FLASH->SR2 & FLASH_SR_DBECCERR)) {
FLASH_IRQHandler();
}
/* USER CODE END HardFault_IRQn 0 */
while (1)
{
/* USER CODE BEGIN W1_HardFault_IRQn 0 */
/* USER CODE END W1_HardFault_IRQn 0 */
}
}
4. To enable flash ECC interrupts, I added the following function in main.c:
void init_flash_ecc() {
HAL_FLASH_Unlock();
Address = FLASH_USER_START_ADDR;
HAL_NVIC_SetPriority(FLASH_IRQn, 0, 0);
HAL_NVIC_EnableIRQ(FLASH_IRQn);
HAL_FLASHEx_EnableEccCorrectionInterrupt();
HAL_FLASHEx_EnableEccDetectionInterrupt();
HAL_FLASH_Lock();
}
5. To define the ECC callbacks, I added the following functions in main.c:
void HAL_FLASHEx_EccCorrectionCallback() {
HAL_GPIO_WritePin(LD2_GPIO_Port, LD2_Pin, 1);
}
void HAL_FLASHEx_EccDetectionCallback() {
HAL_GPIO_WritePin(LD1_GPIO_Port, LD1_Pin, 1);
}
Flash ECC Testing
1. Call the init_flash_ecc function defined above.
2. Erase user flash
3. To cause a single bit error, I use the following function and write data:
uint64_t SingleErrorA[4] = { 0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCC0
};
uint64_t SingleErrorB[4] = { 0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCC1
};
void cause_flash_single_error() {
HAL_FLASH_Unlock();
if (HAL_FLASH_Program(FLASH_TYPEPROGRAM_FLASHWORD, Address, ((uint32_t) SingleErrorB)) != HAL_OK) {
Error_Handler();
}
if (HAL_FLASH_Program(FLASH_TYPEPROGRAM_FLASHWORD, Address, ((uint32_t) SingleErrorA)) != HAL_OK) {
Error_Handler();
}
uint64_t readData[4];
for (int i = 0; i < 4; i++) {
readData[i] = *((uint64_t*) (Address + i * 8)); // Read 64 bits at a time
}
HAL_FLASH_Lock();
}
4. To cause a double bit error, I use the following function and write data:
uint64_t DoubleErrorA[4] = { 0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCB
};
uint64_t DoubleErrorB[4] = { 0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC,
0xCCCCCCCCCCCCCCCC
};
void cause_flash_double_error() {
HAL_FLASH_Unlock();
if (HAL_FLASH_Program(FLASH_TYPEPROGRAM_FLASHWORD, Address, ((uint32_t) DoubleErrorB)) != HAL_OK) {
Error_Handler();
}
if (HAL_FLASH_Program(FLASH_TYPEPROGRAM_FLASHWORD, Address, ((uint32_t) DoubleErrorA)) != HAL_OK) {
Error_Handler();
}
uint64_t readData[4];
for (int i = 0; i < 4; i++) {
readData[i] = *((uint64_t*) (Address + i * 8)); // Read 64 bits at a time
}
HAL_FLASH_Lock();
}
RAM ECC Error Handling
1. In the .IOC file, go to System Core -> RAMECC. Check the boxes next to each region of RAM you want to monitor.
2. In main.c, navigate to the auto-generated function MX_RAMECC_Init. At the top of this function, I initialize all monitored regions of RAM by writing 0 to them. At the bottom, I added the following lines to enable the RAM ECC IRQ:
HAL_NVIC_SetPriority(ECC_IRQn, 0, 0);
HAL_NVIC_EnableIRQ(ECC_IRQn);
3. Enable notifications and start monitoring for each RAMECC handle. I did it using the following function in main.c:
void enable_ramecc_monitor_notifications(RAMECC_HandleTypeDef *hramecc) {
if (HAL_RAMECC_EnableNotification(hramecc, (RAMECC_IT_MONITOR_SINGLEERR_R | RAMECC_IT_MONITOR_DOUBLEERR_R)) != HAL_OK) {
Error_Handler();
}
if (HAL_RAMECC_StartMonitor(hramecc) != HAL_OK) {
Error_Handler();
}
}
4. Add the following callback in main.c:
void HAL_RAMECC_DetectErrorCallback(RAMECC_HandleTypeDef *hramecc) {
uint32_t FAR;
FAR = HAL_RAMECC_GetFailingAddress(hramecc);
if ((HAL_RAMECC_GetRAMECCError(hramecc) & HAL_RAMECC_SINGLEERROR_DETECTED) != 0U) {
HAL_GPIO_WritePin(LD1_GPIO_Port, LD1_Pin, 1);
}
if ((HAL_RAMECC_GetRAMECCError(hramecc) & HAL_RAMECC_DOUBLEERROR_DETECTED) != 0U) {
HAL_GPIO_WritePin(LD2_GPIO_Port, LD2_Pin, 1);
}
hramecc->RAMECCErrorCode = HAL_RAMECC_NO_ERROR;
HAL_GPIO_WritePin(LD3_GPIO_Port, LD3_Pin, 1);
}
4. Navigate to Core -> Src -> stm32h7xx_it.c. Add function ECC_IRQHandler in the user code section, and add checks for each enabled RAM ECC monitor to see if they have any flags raised. This will enable the IRQ handler to pass the callback the respective ECC handle. Here is mine:
void ECC_IRQHandler(void)
{
if (__HAL_RAMECC_GET_FLAG(&hramecc1_m1, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc1_m1);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc1_m2, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc1_m2);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc1_m3, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc1_m3);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc1_m4, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc1_m4);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc1_m5, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc1_m5);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc2_m1, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc2_m1);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc2_m2, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc2_m2);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc2_m3, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc2_m3);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc2_m4, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc2_m4);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc2_m5, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc2_m5);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc3_m1, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc3_m1);
}
if (__HAL_RAMECC_GET_FLAG(&hramecc3_m2, RAMECC_FLAGS_ALL)) {
HAL_RAMECC_IRQHandler(&hramecc3_m2);
}
}
RAM ECC Testing
1. In step 2 of the RAM ECC error handling section, I say to initialize all sections of monitored ram by writing 0 to them. To cause an ECC error in a specific region of RAM, skip this initialization for the region of RAM you want to test.
2. In the main function of main.c, I call a function to read from all regions of RAM. When this function tries to read from the section of RAM I did not initialize in step 1, the ECC callback is triggered.
Conclusion
I did the best I could with the time I had to write this guide, so it is not perfect, but my hope is that it can help someone and make this process easier for people scouring the forum to try and find concrete examples of how to implement and test ECC.
-Jared
2024-10-08 05:40 AM
Hey Jared. Ive been struggling to trigger a flash ecc error. I followed AN5342 to the best of my ability but couldn't figure it out.
Thanks a lot for your guide. Helped me out a lot.
And yes, a Double Error Detect always triggers the hard fault handler after the memory read, whether I use interrupts or just poll the error flag. A Single Error Correction works just fine.
What I noticed with the memory analyzer is when you make the second flash write to trigger the DED, it seems to either corrupt the data or lose access upon a failed ecc? I am not sure. In that case, performing a read of the memory would trigger the hardfault handler. See attached.
Here is the memory after flash erase.
Memory after the first write
// I am writing 0xCCCCCCC3 at the end
Memory after the second write
// I write 0xCCCCCCC0 to trigger the DED
Anyway, Im not sure if that is what should be expected or not.
2024-10-08 07:28 AM
Hello,
I'm glad the guide was able to help.
The behavior you are seeing with the memory analyzer aligns with what I was seeing.
Out of curiosity, what sort of application are you working on that requires handling ECC errors?
- Jared
2024-10-10 09:12 AM
Its not a requirement, just curious if an ECC error occurs and if so, how often in an industrial setting.
2024-10-16 05:44 AM
Hi @snmunters,
we are preparing a new revision of the AN5342, focused precisely on triggering the ECC error in order to test the firmware safety features (STL/SIL). If writing twice same flash DW doesn't work, then doing a reset during programming of flash should work.
Regarding error probability in real world, in flash it's matter of aging, accelerated greatly by temperatures.
For SRAM we had a case of small gas meter manufacturer that manufactured ~500k meters neglecting the safety standards, had ~100 returns the first year only, all of them due to SRAM ECC failure. It was probably even more the second year and if 100 returned, we can only guess how many more failed but were simply thrown away. So mainly in battery powered applications, where the SRAM remains active for prolonged durations, it's a serious problem.
BR,
J
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
2024-10-16 05:45 AM
Thanks @20jmorrison ,
do you mind if we incorporate parts of your work here in future update of the AN5342?
BR,
J
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
2024-10-16 06:00 AM
First off, incredible username and profile picture @Bubbles , lol.
Absolutely, feel free to use any/all parts of it that you see fit.
-Jared
2024-10-16 06:40 AM
Hey Bubbles, Thanks for the clarification!
Perhaps in the new version of AN5342, an explicit comment that cache ecc is not possible to manually test could be added? Maybe a more concise step-by-step guide on how to test flash and ram ecc would also be a good addition.
Anyway, just my 2 cents.
2024-10-25 01:35 AM
You could say I'm a huge fan, including the actors musical career.
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.