Flash corruption during manufacturing of a battery operated device
Hi All,
MCU we are using is :- STM32G0B1CE
We are manufacturing a battery operated device. Recently we started to see some of the devices getting bricked. The electrical inspection looks good. When we extracted the complete binary out of this bricked device we saw that double word zeros in the code section at a random location. We have observed same problem on 4 devices.
We have a NMI handler in place to handle flash corruption(double word errors). In the handler we write double word zeros in the corrupted flash address(which we get it from FLASH_ECCR register)
we should have kept some if conditions to block nmi handler to write in code area. But my question is how it is possible that a code area is getting corrupted?
Even though we make battery operated device which can be charged by usb. The device is failing in a station where the battery is already assembled long back with good battery voltage and also the there are no writes(no flashing happening, no device firmware update is happening) happening onto the code area when this issue is reported.
Any insights on why this is the case ? We almost ran out of all the ideas
our NMI handler
void NMI_Handler(void)
{
/* USER CODE BEGIN NonMaskableInt_IRQn 0 */
if(__HAL_FLASH_GET_FLAG(FLASH_FLAG_ECCD1) || __HAL_FLASH_GET_FLAG(FLASH_FLAG_ECCD2))
{
uint32_t badAddress = 0x08069000;
/* Check if NMI is due to flash ECCD (error detection) */
if(__HAL_FLASH_GET_FLAG(FLASH_FLAG_ECCD1))
{
/* calculate the bad address, ADDR_ECC contains the value of double word offset */
if (READ_BIT(FLASH->OPTR, OB_USER_DUALBANK_SWAP_DISABLE) != 0)
{
badAddress = FLASH_BASE_1 + ((FLASH->ECCR & FLASH_ECCR_ADDR_ECC) * 8);
}
else
{
badAddress = FLASH_BASE_2 + ((FLASH->ECCR & FLASH_ECCR_ADDR_ECC) * 8);
}
/* Clearing the flag anyway. If deletion failed it will be set again*/
__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_ECCD1);
}
else
{
/* calculate the bad address, ADDR_ECC contains the value of double word offset */
if (READ_BIT(FLASH->OPTR, OB_USER_DUALBANK_SWAP_DISABLE) != 0)
{
badAddress = FLASH_BASE_2 + ((FLASH->ECC2R & FLASH_ECC2R_ADDR_ECC) * 8);
}
else
{
badAddress = FLASH_BASE_1 + ((FLASH->ECC2R & FLASH_ECC2R_ADDR_ECC) * 8);
}
/* Clearing the flag anyway. If deletion failed it will be set again*/
__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_ECCD2);
}
/* Delete the corrupted flash address */
if (eraseCorruptedFlashAddress((uint32_t)badAddress) == HAL_OK)
{
/* Resume execution if deletion succeeds */
return;
}
/* If we do not succeed to delete the corrupted flash address */
/* This might be because we try to write 0 at a line already considered at 0 which is a forbidden operation */
/* This problem triggers PROGERR, PGAERR and PGSERR flags */
else
{
/* We check if the flags concerned have been triggered */
if((__HAL_FLASH_GET_FLAG(FLASH_FLAG_PROGERR)) && (__HAL_FLASH_GET_FLAG(FLASH_FLAG_PGAERR))
&& (__HAL_FLASH_GET_FLAG(FLASH_FLAG_PGSERR)))
{
/* If yes, we clear them */
__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_PROGERR);
__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_PGAERR);
__HAL_FLASH_CLEAR_FLAG(FLASH_FLAG_PGSERR);
/* And we exit from NMI without doing anything */
/* We do not invalidate that line because it is not programmable at 0 till the next page erase */
/* The only consequence is that this line will trigger a new NMI later */
return;
}
}
}
/* Go to infinite loop/reboot when NMI occurs in case:
- ECCD is raised in eeprom emulation flash pages but corrupted flash address deletion fails (except PROGERR, PGAERR and PGSERR)
- no ECCD is raised */
/* reboot the MCU */
HAL_NVIC_SystemReset();
/* USER CODE END NonMaskableInt_IRQn 0 */
/* USER CODE BEGIN NonMaskableInt_IRQn 1 */
while (1)
{
}
/* USER CODE END NonMaskableInt_IRQn 1 */
}
static int eraseCorruptedFlashAddress(uint32_t address)
{
uint64_t data = 0U; // The erased value
HAL_StatusTypeDef status;
HAL_FLASH_Unlock();
status = HAL_FLASH_Program(FLASH_TYPEPROGRAM_DOUBLEWORD, address, data);
HAL_FLASH_Lock();
return status;
}