cancel
Showing results for 
Search instead for 
Did you mean: 

STM32H755 stalling CPU when erasing bank 1 flash from code running in bank 2 (but not the other way around)

mark03
Associate III

I am implementing a firmware-update mechanism on STM32H755 (Nucleo board, MCU rev U) in which the two flash banks are independent.  New firmware is flashed by erasing the "other" bank (which is always mapped at 0x0810 0000 and accessed with the ____2 register set, e.g. FLASH_CR2), then writing the candidate firmware image into that bank, verifying a SHA hash after programming, and finally toggling the SWAP_BANK bit in option bytes, performing an option-byte update, and software reset to boot into the new image.  This works, and I can perform one update after another, ping-ponging back and forth between the banks.  Note that the running application is always mapped at 0x0800 0000, is [much] less than 1 MByte, and never touches the other bank at 0x0810 0000 except to erase and program it when an update is requested.

The issue I am having is that when the code runs in Bank 2 and an update commences, the bank erase of Bank 1 is stalling the CPU for several seconds.  The same thing does *not* happen when the code is running in Bank 1 and performs a bank erase on Bank 2.  Also, the actual programming never stalls the CPU---only the initial bank erase, and only when it is Bank 1 being erased.

AFAICT, there is no difference at all between these two scenarios---everything should be identical between them, down to the contents of every single register and memory location, *except* for the SWAP_BANK bit in the option bytes.  So why would one behave differently?  (Note, for this test, I am updating with an identical firmware image, so there is no difference pre- and post-update.)

I was able to find corroborating evidence of this unexpected behavior in another forum post, here:

https://community.st.com/t5/stm32-mcus-products/bus-stall-when-writing-erasing-flash-on-stm32h7/td-p/360606 

(Note the OP is not describing it---it's later in the thread, user @MDrew.1 , and they apparently never got to the bottom of it.)

I know better than to claim a silicon bug :)  although I cannot think of another explanation.  Can you?  Best-case scenario, someone has figured out an undocumented way around this problem.

Why does this matter?  Well, I'd like to use a watchdog timer, and 4+ seconds is a pretty long watchdog expiration.  Also, other RTOS tasks are supposed to be running in the background during the update process, and they do, just fine---when the code is running in Bank 1.

1 ACCEPTED SOLUTION

Accepted Solutions
mark03
Associate III

I'm happy to report that I found the root cause of the bus stall.  And it's not what I hypothesized above.

@mƎALLEm was on the right track, except the issue was not speculative access at all.  (It would be surprising for the CPU to speculatively access system memory, out of the blue, unless something else was reading nearby addresses.)  No, the issue was intentional access...

My ADC driver was reading the VREFINT calibration data at 0x1ff1e860, which I forgot is technically part of Bank 1 flash.  It seems odd that a region you can't erase, and can't program, can nevertheless block a flash erase operation in that bank.  Presumably there is a hardware reason for this?  I fell into the trap of thinking of it as a separate ROM area, disconnected from flash.

Setting up the MPU to block system-memory accesses quickly gave me a MemManage fault, which allowed a quick pinpoint of the problem code.  In my project this is easy to resolve by storing the cal data once during initialization instead of accessing it repeatedly.

View solution in original post

11 REPLIES 11
mƎALLEm
ST Employee

Hello,

You did accept the solution in this thread: STM32H755 failing to erase bank 1 while running from bank 2 (opposite works fine) , do you mean you didn't solve your issue? If yes, better to continue the discussion.

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.
mark03
Associate III

I don't think code is going to explain better than I already have above, but in case someone asks, here is the code which erases the inactive flash bank.  The CPU stall happens at the first write to FLASH_CR2 (setting the BER and START bits), and lasts about four seconds.  Afterwards, the code resumes and everything works as expected.  (The actual flash writes do not stall the CPU.)  And again, the stall doesn't happen at all if we are running in Bank 1 and staging a new image in Bank 2, only when running in Bank 2 and staging the new image in Bank 1.

// Bank-erase the inactive flash (always BANK2, i.e. the bank mapped at 0x08100000)
if (FLASH->CR2 & FLASH_CR_LOCK)
{
  FLASH->KEYR2 = 0x45670123;
  FLASH->KEYR2 = 0xcdef89ab;
}

FLASH->CR2 |= (FLASH_CR_BER | FLASH_CR_START);  // start erase of inactive bank
while (FLASH->SR2 & FLASH_SR_QW)
{ vTaskDelay(pdMS_TO_TICKS(100)); }  // yield to other tasks while we wait

FLASH->CR2 |= FLASH_CR_PG;  // enable programming on this bank

// programming flash at 0x0810 0000 happens here (not shown)

// New image hash checked out; swap banks, update option bytes, and reset
if (FLASH->OPTCR & FLASH_OPTCR_OPTLOCK)
{
  FLASH->OPTKEYR = 0x08192a3b;
  FLASH->OPTKEYR = 0x4c5d6e7f;
}

FLASH->OPTSR_PRG ^= FLASH_OPTSR_SWAP_BANK_OPT;  // toggle the SWAP_BANK_OPT bit
FLASH->OPTCR |= FLASH_OPTCR_OPTSTART;           // start option-byte write
while (FLASH->OPTSR_CUR & FLASH_OPTSR_OPT_BUSY) // wait for it to complete
{ vTaskDelay(pdMS_TO_TICKS(100)); }

SCB->AIRCR = (0x5fa << SCB_AIRCR_VECTKEY_Pos) | SCB_AIRCR_SYSRESETREQ_Msk;

The other thread was about a separate misunderstanding I had about the ____1 registers versus the ____2 registers in the FLASH block; I didn't realize that swapping the banks also swaps the register sets, so that the ____2 registers *always* relate to whichever bank is mapped at 0x0810 0000.  The issue I'm seeing now is different, hence the new thread.  I've posted my updated code in case that clarifies the question.

waclawek.jan
Super User

VTOR is set how?

JW

VTOR does not need to be changed because the interrupt vectors are in the same place as usual, 0x0800 0000, regardless of which bank is mapped there.  An updated firmware image contains everything that the original image contains, including the interrupt vectors and the code which performs firmware updates.  (In this scheme, there is no separate bootloader.)

mark03
Associate III

@mƎALLEm and other ST employees monitoring the forum, this seems like it wouldn't be too difficult to reproduce.  Could one of you try that?  If so, I think this should be raised as a possible silicon errata.  I'm happy to experiment on my end first if you have any [well reasoned and supported] suggestions for things to try.

After reset, VTOR points to 0, not to 0x0800 0000.

I'm not sure what's the mechanism which makes it to point elsewhere, so maybe it's worth to check if it's not interrupt code execution which actually stalls the processor during FLASH bank erasure.

JW

Ah, right, well it should be irrelevant either way.  There's no reason in this scheme for VTOR to be changed, as everything is running at the exact same memory addresses as it does in the most boring "hello world" demo application.

I'm still waiting for an ST rep to comment on whether this might be a silicon bug...  And while we wait, I am going to try out a hypothesis:

HYPOTHESIS

As explained in the Reference Manual and posts by knowledgeable folks on this forum, the FLASH module AXI interface incorporates logic to prevent bus accesses to a flash bank during a write or erase operation.  This is what "stalls the CPU"---it's not the core per se, but AXI bus accesses to flash which are held off until the write/erase operation completes.  I hypothesize that the designers made a mistake here and did not take SWAP_BANK into account.  Specifically, they assumed that accesses to 0x0800 0000 through 0x080F FFFF are in bank 1, and accesses to 0x0810 0000 through 0x081F FFFF are in bank 2, always.

Is this true?  I don't know, but I believe it would explain all of the observed phenomena.  When SWAP_BANK=0, the banks are mapped into the address space exactly as described above, and the AXI bus lock-out works.  This explains why I can bank-erase bank 2 without causing the code running in bank 1 to stall.  On the other hand, when SWAP_BANK=1, it also explains why I *cannot* bank-erase bank 1 without causing a stall for the code running in bank 2:  The lock-out circuit "knows" that I am erasing bank 1, and it "sees" accesses to 0x0800 0000 through 0x080F FFFF, which it "thinks" is bank 1 (but it really bank 2 due to SWAP_BANK=1).  Thus, it erroneously stalls the bus.

This hypothesis even predicts why it is that the actual flash programming (write operations) following the bank erase never cause a stall, regardless of the SWAP_BANK status.  Those writes are always performed to the range 0x0810 0000 to 0x081F FFFF, which "looks like" bank 2 to the lock-out logic, regardless of whether it really is bank 1 or bank 2.  Since the lock-out logic believes the executing code to be in bank 1 (because the memory addresses are 0x0800 0000 to 0x080F FFFF), it sees no conflict and does not stall the bus.

Crucially, my hypothesis makes a prediction.  If, when SWAP_BANK=1, the lock-out circuit is causing an AXI bus stall during an erase of bank 1, with code executing in bank 2, then it follows that I should be able to perform an erase of bank 2 without causing a stall.  Of course, that would nuke the running code.  So a better test would be to perform a single sector erase, beyond the end of the program image (currently it fits in just one sector).  Before doing that, we'll need to verify that the exact same single-sector erase in bank 1 still causes an erroneous bus stall.

I'll post a follow-up with testing results later today or tomorrow.  In the meantime, it would be great if an ST expert would weigh in on this.

Hello,

The phenomena could be due to a CM7 speculative access to the System memory region.
So try to disable the access to the System memory region using this MPU config during the program/erase process:

static void MPU_Config(void)
{
  MPU_Region_InitTypeDef MPU_InitStruct;

  /* Disable the MPU */
  HAL_MPU_Disable();

  /* Configure the MPU as Strongly ordered for not defined regions */
  MPU_InitStruct.Enable = MPU_REGION_ENABLE;
  MPU_InitStruct.BaseAddress = 0x00;
  MPU_InitStruct.Size = MPU_REGION_SIZE_4GB;              
  MPU_InitStruct.AccessPermission = MPU_REGION_NO_ACCESS; 
  MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;
  MPU_InitStruct.Number = MPU_REGION_NUMBER0;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL0;
  MPU_InitStruct.SubRegionDisable = 0x87; // Apply the MPU background config EXCEPT for the regions having the value set to 1
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  HAL_MPU_ConfigRegion(&MPU_InitStruct);
	
	
  MPU_InitStruct.Enable = MPU_REGION_ENABLE;
  MPU_InitStruct.BaseAddress = 0x1FF00000;
  MPU_InitStruct.Size = MPU_REGION_SIZE_512KB;
  MPU_InitStruct.AccessPermission = MPU_REGION_NO_ACCESS;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;
  MPU_InitStruct.Number = MPU_REGION_NUMBER1;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL0;
  MPU_InitStruct.SubRegionDisable = 0x0;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /* Enable the MPU */
  HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);
} 

If it doesn't work try to disable all the instruction access to the code region except for BANK2/ITCM (if you are using it) using the MPU:

mALLEm_0-1759397318596.png

 


@mark03 wrote:

it would be great if an ST expert would weigh in on this.


If all the above couldn't solve the issue, and regarding your proposal to assign that task to a ST employee, for that you need to provide a complete but minimal project that reproduces the behavior preferably on one of the ST boards...

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.