STM32H735 HyperBus speed byte vs word (32bit) vs QSPI

LCE · ‎2023-02-13

Hello,

I got my hands on a H735 Discovery Board which has an external "HyperRAM" (Infineon S70KL1281DABHI023) connected to OCTOSPI2.

I need to use it in memory mapped mode, which is quite easily setup via HAL stuff (just switched to register setup to learn how it works and get better control).

First of all, it's working, clock is 100 MHz. Speed checked with the ARM's cycle counter.

Speed: when reading / writing big buffers (>1kB)

32 bit word access, read / write ~ 50 MB/s (400 Mbit/s)
byte access, read / write ~ 8 MB/s (66 Mbit/s)

Okay, byte access with a simple for loop takes a few more cycles per byte, but the interesting thing is that with the same clock and a QUAD SPI RAM and byte access I got speeds up to almost 25 MB/s (200 Mbit/s).

Question:

Is that some normal behavior of the peripheral due to different byte / word / ocal / quad handling inside the peripheral?

Or is there some setting I can tune for octal mode?

The one big difference I see in the settings between Hyper / quad is the "refresh rate" which the Hyper RAM needs (a high NCS every now and then to trigger its internal refresh). But even playing around with that didn't change much in byte mode.

LCE · ‎2023-02-14

Any ideas?

LCE · ‎2023-02-16

So, I've been playing with that HyperRam, and there are some amazing results and insights.

Still running "HyperRAM" (Infineon S70KL1281DABHI023) with 100 MHz.

I wrote a super simple test function, writing and reading the complete RAM with 32-bit pointers, writing index in one straight for loop, then another for loop with reading and comparing.

I use the cycle counter to measure the timing, and because I could not believe the (good) results, I checked also with the (1ms) SysTick.

uint32_t *pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;
 
/* write complete HyperRAM */
	u32TickStart = HAL_GetTick();
	u32CycStart = DWT->CYCCNT;
 
	for( i = 0; i < u32MaxLen; i++ )
	{
		pu32MemAddr[i] = i;
	}
	u32Cycles = DWT->CYCCNT - u32CycStart;
	u32Ticks = HAL_GetTick() - u32TickStart;
 
snip, some UART output...
 
/* read complete HyperRAM and compare */
	pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;
	u32TickStart = HAL_GetTick();
	u32CycStart = DWT->CYCCNT;
 
	for( i = 0; i < u32MaxLen; i++ )
	{
		u32Data = pu32MemAddr[i];
		if( u32Data != i ) u32Errors++;
	}
	u32Cycles = DWT->CYCCNT - u32CycStart;
	u32Ticks = HAL_GetTick() - u32TickStart;

Here's the result (UART output):

****** OCTOSPI2 HyperRAM test ******
writing 0x00400000 = 4194304 32bit words ...
16777216 bytes written
35176768 CPU cycles -> 87941.9 us
-> 1526.21 Mbit/s
88 ms (ticks)
-> 1525.20 Mbit/s
 
reading and comparing 0x00400000 = 4194304 32bit words ...
16777216 bytes read
43689082 CPU cycles -> 109222.7 us
-> 1228.84 Mbit/s
110 ms (ticks)
-> 1220.16 Mbit/s
 
NULL errors
OctoSpi2HyTest() success, no errors

So... it takes only 2.6 CPU cycles per iteration in average for the reading for-loop?

Is that even possible?

And also the about 2 cycles per iteration for the write loop, reaching almost the theoretical maximum of 100 MHz / 8-bit / DDR of 1600 Mbit/s? Holy Moly...

What am I doing wrong?

I checked the HyperRAM contents, and everywhere I looked, I found the correct content.

LCE · ‎2023-02-16

So, next test was my actual application:

SAI -> DMA -> HyperRAM -> DMA -> Ethernet

And here it "failed":

25.6 Mbit/s (4 audio channels, 32-bit, sampling rate 200 kHz) -> (almost) no problems

but...

51.2 Mbit/s (8 channels) -> dropouts in audio samples, until the complete buffer management crashes!

So, thinking about that, this is no big surprise:

the SAI DMA constantly pushes data into the HyperRAM (every 5 µs at 200 kHz), so the ETH DMA doesn't really have the time to get complete packets.

And the constant switching between reading and writing at completely different addresses takes even more time.

Next step:

let SAI DMA write into internal SRAM, then copy complete packet into HyperRAM, either via DMA (MEM2MEM), or as it seems now, with a simple for loop.

LCE · ‎2023-02-16

Right now the results for 1460 bytes (TCP's payload maximum) are:

DMA MEM2MEM (best settings: burst = 16 beats, size = byte)

read from HyperRAM to AXI RAM: 12666 CPU cycles -> 31.7 us -> 368 Mbit/s
write HyperRAM from AXI RAM: 9965 CPU cycles -> 24.9 us -> 468 Mbit/s

for loop with 32-bit pointer:

read from HyperRAM to AXI RAM: 12251 CPU cycles -> 31.6 us -> 392 Mbit/s
write HyperRAM from AXI RAM: 3059 CPU cycles -> 7.6 us -> 1573 Mbit/s

Again, the simple for loop for writing is insanely fast - unless there's something I'm doing wrong...

I had never expected that this simple copying loop would take less time.

Tesla DeLorean · ‎2023-02-16

I'm not super invested in this, but probably want to look at the MPU and caching settings, how it writes back, or folds things.

These types of memory have a lot of latency when used in random-access, rather the page level bursts.

DMA doesn't have write buffering, or caching, just the minimum number of transistors to accomplish the task.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

LCE · ‎2023-02-16

Thanks for your input!

Actually, because I didn't want to worry about cache management for a start, all caches are OFF.

What I don't get is that the for loop is that fast, how can the write loop take only about 2 CPU cycles per iteration?

That some "internal ARM core specialty"?

LCE · ‎2023-02-16

And here comes the problem, which I forgot to tell about yesterday because it didn't happen then, but I had seen that before:

The HyperRAM speed is very dependent on compilation, I changed the code a little bit, some place completely unrelated to HyperRAM, and after that compilation I get only:

writing: 490 Mbit/s
reading: 456 Mbit/s
compared to > 1000 Mbit/s yesterday.

What going on there?

Okay, checking the list files... if I can get it back to the higher speed.