Flash corruption during manufacturing of a battery operated device

Kpodu.1 · ‎2023-03-01

Hi All,

MCU we are using is :- STM32G0B1CE

We are manufacturing a battery operated device. Recently we started to see some of the devices getting bricked. The electrical inspection looks good. When we extracted the complete binary out of this bricked device we saw that double word zeros in the code section at a random location. We have observed same problem on 4 devices.

We have a NMI handler in place to handle flash corruption(double word errors). In the handler we write double word zeros in the corrupted flash address(which we get it from FLASH_ECCR register)

we should have kept some if conditions to block nmi handler to write in code area. But my question is how it is possible that a code area is getting corrupted?

Even though we make battery operated device which can be charged by usb. The device is failing in a station where the battery is already assembled long back with good battery voltage and also the there are no writes(no flashing happening, no device firmware update is happening) happening onto the code area when this issue is reported.

Any insights on why this is the case ? We almost ran out of all the ideas

our NMI handler

void NMI_Handler(void)

{

/* USER CODE BEGIN NonMaskableInt_IRQn 0 */

if(__HAL_FLASH_GET_FLAG(FLASH_FLAG_ECCD1) || __HAL_FLASH_GET_FLAG(FLASH_FLAG_ECCD2))

{

uint32_t badAddress = 0x08069000;

/* Check if NMI is due to flash ECCD (error detection) */

if(__HAL_FLASH_GET_FLAG(FLASH_FLAG_ECCD1))

{

/* calculate the bad address, ADDR_ECC contains the value of double word offset */

if (READ_BIT(FLASH->OPTR, OB_USER_DUALBANK_SWAP_DISABLE) != 0)

{

badAddress = FLASH_BASE_1 + ((FLASH->ECCR & FLASH_ECCR_ADDR_ECC) * 8);

}

else

{

badAddress = FLASH_BASE_2 + ((FLASH->ECCR & FLASH_ECCR_ADDR_ECC) * 8);

}

/* Clearing the flag anyway. If deletion failed it will be set again*/

__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_ECCD1);

}

else

{

/* calculate the bad address, ADDR_ECC contains the value of double word offset */

if (READ_BIT(FLASH->OPTR, OB_USER_DUALBANK_SWAP_DISABLE) != 0)

{

badAddress = FLASH_BASE_2 + ((FLASH->ECC2R & FLASH_ECC2R_ADDR_ECC) * 8);

}

else

{

badAddress = FLASH_BASE_1 + ((FLASH->ECC2R & FLASH_ECC2R_ADDR_ECC) * 8);

}

/* Clearing the flag anyway. If deletion failed it will be set again*/

__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_ECCD2);

}

/* Delete the corrupted flash address */

if (eraseCorruptedFlashAddress((uint32_t)badAddress) == HAL_OK)

{

/* Resume execution if deletion succeeds */

return;

}

/* If we do not succeed to delete the corrupted flash address */

/* This might be because we try to write 0 at a line already considered at 0 which is a forbidden operation */

/* This problem triggers PROGERR, PGAERR and PGSERR flags */

else

{

/* We check if the flags concerned have been triggered */

if((__HAL_FLASH_GET_FLAG(FLASH_FLAG_PROGERR)) && (__HAL_FLASH_GET_FLAG(FLASH_FLAG_PGAERR))

&& (__HAL_FLASH_GET_FLAG(FLASH_FLAG_PGSERR)))

{

/* If yes, we clear them */

__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_PROGERR);

__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_PGAERR);

__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_PGSERR);

/* And we exit from NMI without doing anything */

/* We do not invalidate that line because it is not programmable at 0 till the next page erase */

/* The only consequence is that this line will trigger a new NMI later */

return;

}

/* Go to infinite loop/reboot when NMI occurs in case:

- ECCD is raised in eeprom emulation flash pages but corrupted flash address deletion fails (except PROGERR, PGAERR and PGSERR)

- no ECCD is raised */

/* reboot the MCU */

HAL_NVIC_SystemReset();

/* USER CODE END NonMaskableInt_IRQn 0 */

/* USER CODE BEGIN NonMaskableInt_IRQn 1 */

while (1)

{

}

/* USER CODE END NonMaskableInt_IRQn 1 */

}

static int eraseCorruptedFlashAddress(uint32_t address)

{

uint64_t data = 0U; // The erased value

HAL_StatusTypeDef status;

HAL_FLASH_Unlock();

status = HAL_FLASH_Program(FLASH_TYPEPROGRAM_DOUBLEWORD, address, data);

HAL_FLASH_Lock();

return status;

}

Danish1 · ‎2023-03-01

Do you have protection against brown-out?

By that, I mean something that will prevent execution (hold the stm32 in reset) during power-up and power-down conditions? It could be an external power-supply-monitor chip or enabling the stm32's internal brown-out circuit at a level appropriate to the assumptions you make about Vdd (processor-speed, flash wait-states, flash writing size).

If not, it is possible that the stm32 will get a FLASH read-error during a brown-out even though the memory location would read correctly within the correct Vdd range.

As your "corrective" action under those circumstances is to erase that memory location, you might end up erasing what was a perfectly good location of memory.

You might also see such problems if you do not have adequate power-supply-decoupling; I reckon on one 0.1 uF ceramic capacitor physically and electrically close to each Vdd, Vss pair, as well as larger decoupling which need not be as close.

One additional thought: Many stm32 will refuse to reprogram a doubleword of FLASH that has already been programmed even if what you're writing is all zeroes because the corresponding error-correction bits won't be all zeroes. Have you checked if your stm32 is one of those? If so, that makes my theory about temporary failure due to brown-out less likely.

Imen.D · ‎2023-03-02

Hello @Kpodu.1 ,

It does not seem that the issue is related to the product but rather to your use case.

It will be better that we understand your application with more details on how you have reproduced this behavior.

Imen

When your question is answered, please close this topic by clicking "Accept as Solution".
Thanks
Imen

Kpodu.1 · ‎2023-03-05

Thanks @Danish for your reply.

We do not use the brown out protection. We are trying to reproduce the issue by constant shutdown and wake up to see if a brownout read error triggered the NMI handler. We verified the power-supply-decoupling with our EE and it looks good.

Thanks @Imen DAHMEN for your reply.

So one common data point for all these units are they all failed at same point. At the end of last-but-one station we put the device we call as shipmode(meaning below in italics). All these units successfully entered shipmode. But in next station when the usb is connected which must wake up the MCU. No USB port appeared indicating the device is bricked.

The purpose of ship mode command is to put the MCU in a very low power mode so that battery doesn’t gets discharged for a long shelf life. The process of entering shipmode involves sudden battery disconnect/power supply to MCU. No RDP or memory protections are enabled on this chip.

WojtekP1 · ‎2023-03-06

Congratulation :) You did the same as my coworker in recent company - his remote needed to store sequence number in flash, and he just used one page. When page ended - he erased page and started from beginning.

With poor battery flash erase - which is high power operation - caused voltage drop and reset with partially or completely erased page and lots of track of sequence numbers. remote stopped working unless you repeated bonding procedure with new battery.

I'm not sure if it's good track (i won't try to analyze HAL based code), but just check it.

With weak battery you device is most likely to fail/get BOR when doing flash erase.

Write all you code to assume this will happen.