Bus stall on STM32F767 during mass flash erase in dual bank mode

Mike Thompson · ‎2023-08-19

At my company were are utilizing several STM2F767 and STM32L4P5 devices within a complex embedded product. Our firmware environment is fairly complex with an RTOS and a lot of peripherals activated and IRQ handling going on during operation.

We wish to take advantage of the read-while-write (RWW) functionality of the dual-bank flash architecture in these devices for firmware upgrades. Although the details of how bank swapping is different between the F767 and L4P5 devices, our general strategy of running in one bank without bus stalls while programming the other should work according to the documentation.

This strategy seems to work find for the L4P5 devices, but I'm running into problems with the F767 devices. On the F767, with the firmware running entirely in bank 1, and performing a mass erase on bank 2, I'm getting bus stalls that block operation of the firmware at random times during the mass erasure. This seems counter to the documentation in the F767 reference manual and AN4826. As far as I can tell, the only thing that might cause this is an attempt by other parts of the firmware to read from bank 2 flash or write to the FLASH_CR register during the mass erasure. I'm 99.9% sure such errant access is not occurring.

Is the read-while-write functionality known to have such problem in STM32F767 devices while operating in dual-bank mode? Or, are there other operations my firmware might be doing to cause bus stalls during mass erasures?

The work arounds to this issue aren't particularly pretty so its important for us to understand why the F767 device isn't working for us in the intended manner.

Thanks,

Mike

Mike Thompson · ‎2023-08-24

I believe I found a solution to my problem. The clue came from the following posting six years ago in these forums.

https://community.st.com/t5/stm32-mcu-products/stm32f7-dual-bank-flash-erase-stall/m-p/334948

The manipulation of the SCB->ITCMCR register as described in that posting does indeed make the stalls go away. What bothers me is that I don't fully understand why.

My working code that erases the non-executing branch now looks as follows:

// Mass erase the bank that is not currently running code.
int fw_flash_erase(void)
{
  HAL_StatusTypeDef status = HAL_OK;

  // Determine the non-executing flash bank to be erased.
  uint32_t bank_to_erase = LL_SYSCFG_GetFlashBankMode() == LL_SYSCFG_BANKMODE_BANK1 ? 
                           FLASH_BANK_2 : FLASH_BANK_1;

  // We need to disable the instruction tightly coupled memory controller while
  // erasing the non-executing flash bank. The specific need for this is unclear,
  // but for now it seems to work. This will need to be further investigated and tested.
  uint32_t scb_itcmcr = SCB->ITCMCR;
  SCB->ITCMCR &= ~SCB_ITCMCR_EN_Msk;
  __ISB();

  // Unlock to enable the flash control register access.
  if ((status = HAL_FLASH_Unlock()) == HAL_OK)
  {
    uint32_t sector_erase_error = 0U;

    // Erase the non-executing flash bank.
    FLASH_EraseInitTypeDef erase_init = { 0 };
    erase_init.TypeErase = FLASH_TYPEERASE_MASSERASE;
    erase_init.Banks = bank_to_erase;
    erase_init.VoltageRange = FLASH_VOLTAGE_RANGE_3;
    status = HAL_FLASHEx_Erase(&erase_init, &sector_erase_error) == HAL_OK ? 0 : -1;
  }

  // Lock to disable the flash control register access.
  HAL_FLASH_Lock();

  // Enable the tightly coupled memory controller.
  SCB->ITCMCR = scb_itcmcr;
  __ISB();

  // Was there a mass erase error?
  if (status != HAL_OK)
  {
    // Log the error.
    SYSLOG_PRINTF(SYSLOG_CRITICAL, "FLASH: Error erasing flash bank %u", bank_to_erase);
  }

  return status == HAL_OK ? 0 : -1;
}

I'll be testing this over the coming weeks to make sure it does indeed solve the issue. I would be interested if someone was able to explain why disabling the instruction tightly coupled memory controller seems to be the solution. I believe the issue is related to mis-predicted branch processing causing read access to the non-executing branch as described here:

https://developer.arm.com/documentation/ka001175/latest

I tried disabling the instruction cache, but it had no effect. But disabling the ITCM via the SCB->ITCMCR register did the trick.

Mike

View solution in original post

Pavel A. · ‎2023-08-20

While you're waiting for more useful answers... this is a long shot but have you considered effect of caches and speculative access to the erased bank? Try to disable all access to that bank via the MPU and flush caches.

STM32L4 is CM4, it is free of these issues.

Mike Thompson · ‎2023-08-21

Thank you for the advice. I'll take a look in that direction. I'm slowly eliminating all of the other processing going on while mass erase is in process, but so far that hasn't yielded anything. I hadn't considered that caching might play a role as the rest of the code has no reason to access the address space of that bank.

I haven't had a reason to mess with the MPU of the M7, but I'll start looking at that. I was pouring over the STM32F7 reference manual for clues and hadn't yet considered the M7 core features being the root cause.

Mike Thompson · ‎2023-08-21

Looking into this deeper it appears to be related to use of the Keil RTX RTOS (via the CMSIS RTOS2 APIs) in the firmware. With dual banks configured, I can mass erase the non-executing branch without stalls all the way up to the call to osKernelStart(). However, once the kernel is started and the first done in the initial thread is to mass erase the non-executing branch, the erase will stall the MCU. I'm trying to understand the what the RTOS kernel might be doing that effects whether the MCU stalls or not.

I did look into potential cache issues and configuring the MPU to protect the bank being erased, but it didn't seem to make a difference. My understanding of this area is a bit limited so I'll continue to pursue this path as well.

FBL · ‎2023-08-23

Hello @Mike Thompson ,

Thank you for your feedback this is quite interesting

Could you confirm that it is related to CMSIS RTOS? Have you tried it without using an RTOS?

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Mike Thompson · ‎2023-08-24

I was able to confirm it was related to the CMSIS RTOS, at least that I could program the non-executing bank without stalls after the RTOS was initialized. However, as soon as I called osKernelStart() to start the initial thread (which also included starting the timer and idle threads), I would then get stalls when doing programming the non-executing branch. I believe that the isn't is so much with the RTOS itself as much as the processor now being much more busy handling the timer interrupts and context switching.

Mike Thompson · ‎2023-08-24

I believe I found a solution to my problem. The clue came from the following posting six years ago in these forums.

https://community.st.com/t5/stm32-mcu-products/stm32f7-dual-bank-flash-erase-stall/m-p/334948

The manipulation of the SCB->ITCMCR register as described in that posting does indeed make the stalls go away. What bothers me is that I don't fully understand why.

My working code that erases the non-executing branch now looks as follows:

// Mass erase the bank that is not currently running code.
int fw_flash_erase(void)
{
  HAL_StatusTypeDef status = HAL_OK;

  // Determine the non-executing flash bank to be erased.
  uint32_t bank_to_erase = LL_SYSCFG_GetFlashBankMode() == LL_SYSCFG_BANKMODE_BANK1 ? 
                           FLASH_BANK_2 : FLASH_BANK_1;

  // We need to disable the instruction tightly coupled memory controller while
  // erasing the non-executing flash bank. The specific need for this is unclear,
  // but for now it seems to work. This will need to be further investigated and tested.
  uint32_t scb_itcmcr = SCB->ITCMCR;
  SCB->ITCMCR &= ~SCB_ITCMCR_EN_Msk;
  __ISB();

  // Unlock to enable the flash control register access.
  if ((status = HAL_FLASH_Unlock()) == HAL_OK)
  {
    uint32_t sector_erase_error = 0U;

    // Erase the non-executing flash bank.
    FLASH_EraseInitTypeDef erase_init = { 0 };
    erase_init.TypeErase = FLASH_TYPEERASE_MASSERASE;
    erase_init.Banks = bank_to_erase;
    erase_init.VoltageRange = FLASH_VOLTAGE_RANGE_3;
    status = HAL_FLASHEx_Erase(&erase_init, &sector_erase_error) == HAL_OK ? 0 : -1;
  }

  // Lock to disable the flash control register access.
  HAL_FLASH_Lock();

  // Enable the tightly coupled memory controller.
  SCB->ITCMCR = scb_itcmcr;
  __ISB();

  // Was there a mass erase error?
  if (status != HAL_OK)
  {
    // Log the error.
    SYSLOG_PRINTF(SYSLOG_CRITICAL, "FLASH: Error erasing flash bank %u", bank_to_erase);
  }

  return status == HAL_OK ? 0 : -1;
}

I'll be testing this over the coming weeks to make sure it does indeed solve the issue. I would be interested if someone was able to explain why disabling the instruction tightly coupled memory controller seems to be the solution. I believe the issue is related to mis-predicted branch processing causing read access to the non-executing branch as described here:

https://developer.arm.com/documentation/ka001175/latest

I tried disabling the instruction cache, but it had no effect. But disabling the ITCM via the SCB->ITCMCR register did the trick.

Mike

ValFerrs86 · ‎2024-05-31

I've configured the board in dual bank mode and my application is working with FreeRTOS.

I use the interrupt mode for upgrading the other flash bank (RWW).

For example, one I start the erase of one sector (of the bank not used by the application), CPU get stalled until EOC interrupt is triggered. After that, CPU works just fine. But that is not the intended behavior. If I perform a mass erase, then CPU is stalled until EOC interrput is triggered (longer time for erasing the entire bank).

"While executing a program code from bank 1, it is possible to perform an erase operation on
bank 2 (and vice versa)." from RM0410.

I also tried to set MPU attributes for bank 2 in order to not have a reading access to it (which by the way the program never does it), but CPU is always stalled until EOC received. Erase and programming are completed correctly as intended.

Found no information also on FreeRTOS.

Does somebody have an idea why by using FreeRTOS looks like the erase and programming phase of the flash in dual-bank "stalls" the read while write feature?

Thank you.

FBL · ‎2024-06-05

Thank you, @Mike Thompson , for your feedback.

When the processor may pre-fetch instructions from memory regions, including flash memory, to improve performance, this feature of speculation specifically in CM7, the processor attempt to access bank 2 while it is still being erased at that time. Check PM0253 4.9.1 Instruction and data tightly-coupled memory control registers.

Check also AN4826_Rev2PUB.book (st.com)

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.