STM32H7 execute code in parallel FLASH

regjoe · ‎2024-11-07

Hello,

I want to execute code located in external 16bit NOR FLASH e.g. on a STM32H7x3 evaluation board.

I can run the NOR FLASH demo (erase, write, read flash) successfully but have no clue how to create and flash code.

After googling 2 days I found plenty of examples for xSPI but none seems to suit this kind of flash type.

Can somebody point me into the right direction e.g. how to configure the linker script and flash the code using the debugger or Cube Programmer. Any example code or ANs?

Thank you

regjoe · ‎2024-12-02

I'm running the benchmarks on this H753-Eval2 HW. I'll check the FMC NOR FLASH initialization and get back. Thank you.

Tesla DeLorean · ‎2024-12-02

For speed, check the caching and MPU_Config()

Generally external buses and memory are markedly slower than internal memory. The external bus is going to be 100 MHz or less, and 400-480 MHz at the MCU

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2024-12-02

NOR Flash is exceedingly slow to erase/write, and doesn't allow for doing so concurrently.

The QSPI could be used as a 4KB sector Mass Storage, and with FATFS

Preferred storage would be large NAND Flash in the form of a MicroSD Card or eMMC chip, where a lot of the complexity from wear management and erased pools are done by the hardware.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

regjoe · ‎2024-12-02

@Tesla DeLorean wrote:
For speed, check the caching and MPU_Config()
Generally external buses and memory are markedly slower than internal memory.

I set the MPU region to non-cacheable and the performance dropped significantly.

Is there any reason why code execution in parallel 16bit/70ns NOR should be slower than in 8bit/50Mhz DTR QSPI?

mƎALLEm · ‎2024-12-03

Forget about FFT for the moment as it's a bit tricky and many files (many dependencies) to relocate in the correct region.

Try to run a simple algorithm where you know how it works and ho to map.

Check also the FMC timings, are they optimal for the memory?..

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

regjoe · ‎2024-12-03

Hello,

I'm trying to understand the FMC NOR initialization, e.g. as done in the FMC_NOR demo as provided by CubeMx for the STM32H743_EVAL2 board.

I debugged the code on the STM32H753_EVAL2 board.

The MCU clock in the demo is set to 400MHz, HCLK to 200MHz.

The NOR timing parameter configuration is

  NOR_Timing.AddressSetupTime      = 9;
  NOR_Timing.AddressHoldTime       = 1;
  NOR_Timing.DataSetupTime         = 5;
  NOR_Timing.BusTurnAroundDuration = 4;
  NOR_Timing.CLKDivision           = 4;
  NOR_Timing.DataLatency           = 2;
  NOR_Timing.AccessMode            = FMC_ACCESS_MODE_B;

For mode 2B the read timing for H7 FMC in the reference manual RM0433 is specified as

As far as I understand this timing, data is read by the H7 MCU at /NE 0->1 edge.

According to the board schematic the EVAL2 RevE is equipped with an NOR FLASH MT28EW128ABA1LPC-0SIT. The datasheet says this device has an access time of 70ns. So data is valid after max. 70ns after /NE 1->0 edge.

According to the H753ii datasheet DS12117 Rev 9 table 163, the H7 has an setup time of min 11ns.

So in my understanding, the entire memory transaction time for a 70ns flash should be calculated as min 70ns+11ns=81ns.

In the demo the FMC is configured to ADDSET+DATAST = 9 + 5 = 14, which calculates to 14*5ns = 70ns.

Using a scope, I measure for the memory read transaction (/NE 1->0, 0->1) ~75ns and for /NE low to data valid ca. 60-65ns.

Is my understanding and calculation regarding the FMC NOR flash read timing and parameters correct?

If so, the demo setup does only work for a 60ns flash. Am I right?

BTW:

According to this document how-to-configure-the-fmc-peripheral-to-interface-an-stm32-mcu the Address Setup Time for the MT28WE flash operation, the timing parameters should be calculated as

tACC = 60ns

tHCLK = 1/200MHz = 5ns

ADDSET = tACC / tHCLK = 60ns / 5ns = 12 HCLK cycles,

DATAST = tWP / tHCLK = 35ns / 5ns = 7 HCLK cycles

ADDSET + DATAST = 12+7 = 19 -> 95ns

Who is right?

Thank you

regjoe · ‎2024-12-03

@regjoe wrote:
Is there any reason why code execution in parallel 16bit/70ns NOR should be slower than in 8bit/50Mhz DTR QSPI?

In most parallel NOR flash datasheet I found that a "Page Read" feature is supported.

This seems to be not available in H7 NOR controller but in NAND controller. IMHO a page read could improve a cache fill operation. Also the xSPI controller and flash seems to be optimized for sequential read operations.

Could this be the reason why the FFT demo running in parallel NOR is 2x slower compared to DQSPI NOR?

Or is the QSPI/OSPI peripheral somewhat optimized for code execution but the FMC NOR is not intended to be used for it?

I wonder why I cannot find any ST32 application that is running code from parallel NOR.

Thanks

regjoe · ‎2024-12-04

Regarding this scope screenshot here (showing fetch of code which is located in the parallel NOR flash)

the measured cycle time for a sequence of random read accesses is ca. 75ns, assuming each spike of /OE is a 16bit data transfer.

This means the data transfer rate is ca. 2/75e-9 = 26.6 MB/s, this is about 1/4 of the theoretical max. 100MB/s of DQSPI/50MHz/DTR.

I guess this is the main reason why the FFT code execution from parallel flash is 2x slower than from DQSPI.

regjoe · ‎2024-12-06

Hello,

to speed up sequential read access, most parallel NOR flash have implemented the s.c. Asynchronous Page Read mode.

The first read from an address is considerably slow e.g. at 70ns. If the following address is in the same page, subsequent reads are done faster e.g. at 15-25ns (see https://community.infineon.com/t5/Knowledge-Base-Articles/Initial-Access-Time-and-Page-Access-Time-in-NOR-Flash/ta-p/254636#. )

Some flash have implemented a s.c. Synchronous Read Mode. This would require additional signals and it seems that these flash are only available from Infineon and are quite expensive.

Unfortunately it seems that the H7 has not implemented the Asynchronous Page Read mode. At least I cannot find an appropriate timing in the data sheets and if so, I don't know how to configure the FMC to support this feature.

Any idea? Is Page Read Mode not available in H7 devices?

regjoe · ‎2024-12-15

Ok, I got the ACK from ST that the burst feature is not supported in asynchronous mode. This explains why code execution from parallel asynchronous NOR is slower than from xSPI NOR and is not recommended.

I think I'll keep on using the 2MB flash µC's from ST due to faster code execution from internal flash and probably use a parallel NOR for scattered const data. As already mentioned, the dual QSPI is already used in non-memory-mapped mode for writing log data.