STM32H750VB Quad-SPI questions

Yevpator · ‎2019-03-22

Hello,

Due to so attractive price + some specific features, the H7 Value Line looks to me very attractive to take into a new design, but 128kB Flash is not enough for me, I have to take an external flash, which is QSPI. I have a few basic questions about that.

1. I am almost sure that Quad-SPI works with the Value line H7 as well, but since I saw in some doc that an H7 Discovery works with Dual SPI only, I have some hesitation. Can one here confirm that?

2. If work through Quad-SPI (QSPI Memory Mapped), how the performance will be reduced vs a normal mode @ 400MHz?

3. Does this Quad-SPI make debugging more complicated? What issues did you face , especially with Atollic?

4. Does ST-Link support H7 programming with Quad-SPI enabled?

5. What are other issues with H7 + Quad SPI, you would recommend me to take in account ?

Thanks!

Andreas Bolsch · ‎2019-03-22

From Table 7 you will deduce that all 2x4 data lines, both chip selects and clock are available even in LQFP100, so it's possible to use two paralleled spi flash devices in QPI mode. For DTR chips this will give 1 byte per clock edge, so this could be pretty fast.

However, using all 11 pins for flash might severly impact availability of other peripherals, so you have to check that for your application there is no conflict.

The performance penalty is impossible to predict. For pure sequential access, you can calculate the throughput as above. But if you need big data areas with random access: each single byte read requires one byte instruction, 3 or 4 address bytes, some dummy cycles, and the actual read, so ...

When the flash is memory mapped, it looks more or less like internal memory, so debugging is not affected except for soft breakpoints, as the actual instruction is temporarily replaced by a breakpoint instruction. I don't know whether any debugger is aware of external flash and could handle this. But single-stepping, hardware breakpoints are no problem.

For ST-Link and STM32CubeProgrammer (and probably anything else): As the pin mapping is variable and the various flash chips differ in some details and do support different modes (one-line, two-line, ...), you can't expect those to work "out-of-the-box" with the external flash. You will have to provide your own board/chip specific external loader, there are already a lot of postings here regarding this. Or you could use openOCD with either this

http://openocd.zylin.com/#/c/4321/

patch or that one

http://openocd.zylin.com/#/c/4760/

Regarding program/read speed: This depends heavily on JTAG/SWD clock. But as long as your code doesn't exceed 1 or 2 MBytes in size, that's no real issue. But mass erase gets a bit slow ...

For the H7, L4+ and MP1 there is a hardware issue (probably exactly the same for all of these three families): Avoid accessing the last few bytes in memory mapped mode, or set the FSIZE field to a higher value than the actual capacity: The last byte is incorrectly read as 0x00, and too persistent accesses to the last few bytes via debugger causes the debug interface to go berserk. For me (Nucleo with STLink) the only cure was power-cycling the board.

MikeDB · ‎2019-03-22

Whilst the H750 doesn't have much (official) Flash, it does have a lot of RAM so an alternative may be to use a slow external SPI flash and actually run the program from RAM.

Probably best thing is to develop on a H743 board first and migrate to the H750 once you have proven it is viable for the job.

Andreas Bolsch · ‎2019-03-23

Ah, forgot to mention AN5188, this contains some figures about internal vs. external code execution. But as the exact setup is not specified and there are too many variables, they're not that helpful. So anything between almost no impact and significant impact is possible.

Yevpator · ‎2019-03-23

Thank you so much for the prompt, detailed and valuable answer. Didn't get everything you wrote, but will give yet another chance to understand by myself before reasking .

Best regards.

Yevpator · ‎2019-03-23

Thank you very much for your valuable answer!

Yevpator · ‎2019-03-23

I did consider your suggestion, but was hesitating about the relative addressing issue. If the code to run in RAM, all the functions calls must be relative to the RAM start address.

If I want to use both Flash (there is 128KB) and ROM, part of the code should have absolute addressing and part - relative, is that possible at all? Seems like easier to run all the code from RAM wasting 128kB Flash, what do you think?

Also I am very concerned if St-link GDB debugger and Atolic will not get crazy with this approach (-:

Seems like I am going to lose all the HW breakpoints, as the 6 comparators are flash based, or I am mistaken?

MikeDB · ‎2019-03-23

I don't see why the program in RAM would be any different to in Flash - they both have fixed addresses in reality so relative and absolute addressing should be possible.

I don't use the same tools so no idea on how they will react, but since the ST bootloaders offer run from RAM it would seem odd if they didn't.

I think if I was doing the same, I would put all kernal functions - a sort of mini OS even - in the Flash so that the system booted quickly, and then it loads other functions from slow external memory to RAM and runs them as needed.

Yevpator · ‎2019-03-31

The mentioned AN5188 contains the table I completely don't understand. According to it the performance while running over the internal Flash has the same CoreMark while running over the Quad-SPI, which does not make sense. Even using 2 QPI chips as you yourself explained it takes 1 byte per clock edge, but the instructions are of 4 bytes long, so that translates to /4 slower, so how it is possible that the CoreMark is the same?

Andreas Bolsch · ‎2019-03-31

This depends heavily on the benchmark and on the precise setup. If the benchmark spends most of the time in tight loops, cache is enabled, and the QSPI address range is marked as cacheable, it's quite possible that there is no significant penalty. Additionally the QSPI has some sort of prefetch mechanism (the details lie in the dark, however), so even short forward branches might have litte effect, but that's pure speculation.

That's why I said that the impact is (almost) impossible to predict. I'm afraid you have to test with your actual application. If performance gets an issue, the first try would be placing only pure code in the QSPI flash, but constant data, tables etc. which are accessed often and at random in RAM. The last resort would be to move all data and code to RAM when booting.