STM32H7 XiP poor performance with QSPI NOR Flash

Klang.Martin · ‎2020-04-15

I've been developing a solution around the STM32H750 that requires external flash to run relatively large firmware images, up to around 2Mb.

The NOR Flash part used is the Cypress s25fl064l, in QSPI configuration. It has quad read rates of up to 54Mbps.

A bootloader configures the QSPI interface, and then jumps to the main firmware on the memory mapped external flash.

Using the DWT to measure the number of cycles initially provided some promising numbers. In a 1-2-2 configuration (Single Instruction, Dual IO read) without SIOO (sending instruction on every read), our computations ran at around 180% cycles compared to running from internal flash. Not so bad, considering this is a slow configuration option with lots of overhead.

  /* QSPI clock = 480MHz / (1+9) = 48MHz */
  QSPIHandle.Init.ClockPrescaler     = 9;
  QSPIHandle.Init.FifoThreshold      = 4;
  QSPIHandle.Init.SampleShifting     = QSPI_SAMPLE_SHIFTING_NONE;
  QSPIHandle.Init.FlashSize          = 22; // 2^(22+1) = 8M / 64Mbit
  QSPIHandle.Init.ChipSelectHighTime = QSPI_CS_HIGH_TIME_1_CYCLE;
  QSPIHandle.Init.ClockMode          = QSPI_CLOCK_MODE_0;

However the elapsed time turned out to be 40x greater than internal flash, taking 84 seconds to run a neural network classification instead of just over 2 seconds. Presumably that's down to lots of wait states incurred.

With a full 4-4-4 QSPI configuration, setting the 'Alternative byte' to read continuously and with SIOO enabled, we expected this to be improved many times. What we found was that the elapsed time roughly halved, while the cycle count went up.

This is what the read command looks like (sent after other commands have configured the NOR flash):

 // Quad I/O 4-4-4, mode cycles 2
    sCommand.Instruction       = QSPI_FLASH_CMD_QIOR;
    sCommand.InstructionMode   = QSPI_INSTRUCTION_4_LINES;
    sCommand.AddressMode       = QSPI_ADDRESS_4_LINES;
    sCommand.DataMode          = QSPI_DATA_4_LINES;
    sCommand.AddressSize       = QSPI_ADDRESS_24_BITS;
    sCommand.AlternateByteMode = QSPI_ALTERNATE_BYTES_4_LINES;
   sCommand.AlternateBytes     = 0xA0; // Continuous read feature is enabled if the mode bits value is Axh.
    sCommand.DummyCycles       = 8;
    sCommand.SIOOMode          = QSPI_SIOO_INST_ONLY_FIRST_CMD;

A computation cycle now takes 20x longer, while the cycle count is 3.5x greater.

We are in this instance running an STM32H753 at 480Mhz.

It's difficult to make much sense of this. There are things we can do to improve performance, such as putting (part of) the code into RAM or enabling DDR mode, but the numbers are so off the mark that simply doubling or even quadrupling performance is not going to fix the problem.

Our results are at odds with e.g. AN4760, which indicates (Performance Analysis, p.78) XiP with code and data in flash to run at 1.52x the speed (and with code in RAM at 1.12x). If we could get close to this I'd be happy, but at the moment we're at 20x.

Another question: when running in XiP mode, memory mapped to 0x90000000 (bank 2), it seems we are not able to put breakpoints in while debugging with OpenOCD and gdb. Is there a particular reason why?

Any help or insight offered would be very much appreciated!

Jack Peacock_2 · ‎2020-04-15

Doing a rough estimate on memory bandwidth, 48MHz x 4 bit yields a burst rate of 24MB/sec for a 256 byte page, ignoring address setup overhead from the QSPI. Internal flash is 64 (not sure about H7, 128?) bit wide, compared to 4 bit at a time for QSPI. All things equal you'll never do better than 16x slower.

Your 20x likely reflects the I-cache hits improving the 16x limitation. Any jump outside the 256 byte page, and no cache hit, you incur a new address setup. Ideally for XiP you want sequential code, short jumps, no subroutine calls, to take advantage of the page lookahead and minimize new page setups.

So for QSPI the ideal instruction rate is around 6M x 32 or 12M x 16 bit instructions/second. Cache will speed that up to what you're seeing, but there's no way around the bandwidth limitation. You would be better off loading external DRAM from the QSPI and running from someplace with a much higher bandwidth.

You might also look at some newer serial NOR parts used to load FPGAs. These parts can run up to 108Mhz and DDR (double data rate, both clock edges) for a substantial improvement on the QSPI bandwidth. DDR gives you 8 bits per clock/access, 108MB/sec burst rate maximum.

XiP is great for low power systems where the QSPI latency isn't so bad at low clock speeds. For an 'H7 you're wasting most of the CPU cycles waiting on external memory.

Jack Peacock

Tesla DeLorean · ‎2020-04-15

Isn't QUADSPI clocking at 240 MHz ? (Max 250), part supposedly rated to 108 MHz SDR, 54 MHz DDR (clock both edges)

Prescaler

1: 120 MHz

2: 80 MHz

3: 60 MHz (54 @ 216 MHz)

4: 48 MHz

9: 24 MHz

Plus this should be cacheable

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Klang.Martin · ‎2020-04-17

That makes a lot of sense Jack. From your calculations, my conclusion would be that XiP on NOR Flash using QSPI only makes sense for low clock speeds.

But it doesn't seem to match the results in e.g. this document (AN4891: STM32H74x and STM32H75x system architecture and performance).

They compare several different configurations, including QSPI NOR Flash, and the spread of performance results are within 1.8x, running at 400 and 480 MHz.

Is it because they are only counting processor operations (cycles), not wait states? Tbh I'm not even sure what the difference is, but I found in my testing that the elapsed time would increase even when the cycle count didn't.

Klang.Martin · ‎2020-04-17

You are of course right Clive, as always, thanks for pointing this out. I had some problems with higher clock speeds, probably due to the GPIO's being configured as `GPIO_SPEED_FREQ_LOW` (CubeMX defaults to this).

DDR at 54 Mhz would be ideal, while 48 Mhz (prescaler 4 instead of 9) with DDR could give me up to 4x speed improvement.

But for XiP I would still be wasting 4 clock cycles out of 5 compared to using internal flash.

waclawek.jan · ‎2020-04-17

> requires external flash to run relatively large firmware images, up to around 2Mb.

2 Megabits?

Doesn't the 'H743/753 come with 2MBytes of internal FLASH?

JW

Klang.Martin · ‎2020-04-20

That's 2 Megabytes, possibly more. And yes, there are parts with 2Mbytes FLASH, but they are too expensive for my application.

My understanding is that Flash is expensive to produce with the same small scale process as modern microcontrollers (40nm for H7?), and that is why ST and NXP are producing newer parts with little or no internal flash. Instead the current move is towards fast QSPI external flash, which is cheap and readily available.

But in use cases like ours, with a large size program that can't be pre-loaded into internal RAM, XiP has to be fast enough to be on a par with internal flash. This doesn't seem to be the case, in spite of what you might think reading ST's application notes on the subject. Or are we just doing it wrong?

waclawek.jan · ‎2020-04-20

> My understanding is that Flash is expensive to produce with the same small scale process as modern microcontrollers (40nm for H7?)

Yes, "pure" FLASH uses usually processes optimized for thinner lines (utilizing physical tricks to achieve them) - and maybe other specialized tricks I don't know of, differentiating the from general-purpose logic interms of the details of used technology.

> that is why ST and NXP are producing newer parts with little or no internal flash.

Little FLASH is a diametrally different case from no FLASH, technology-wise.

The little-FLASH ST parts are in fact the same large-FLASH dies, except most of the FLASH is not tested, as testing represents a non-negligible portion of manufacturing cost. This does not mean ST won't create truly small-FLASH version of the die in the future, if demand justifies the $Ms for a fresh maskset.

> Instead the current move is towards fast QSPI external flash, which is cheap and readily available.

... and creating bottlenecks.

Aren't the claimed speeds achieved for synthetic benchmarks such as CoreMark? They notoriously have severely limited validity for real-world applications. The touted performance boost relies heavily on the caches, i.e. works best for code executing most of the time short repeating patterns (relatively tight loops). Cache won't help if you execute non-repetitively more code than fits into instruction cache, and use non-repetitively more constant data than fits into the data cache.

JW