STM32H735 HyperBus speed byte vs word (32bit) vs QSPI
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2023-02-13 2:27 AM
Hello,
I got my hands on a H735 Discovery Board which has an external "HyperRAM" (Infineon S70KL1281DABHI023) connected to OCTOSPI2.
I need to use it in memory mapped mode, which is quite easily setup via HAL stuff (just switched to register setup to learn how it works and get better control).
First of all, it's working, clock is 100 MHz. Speed checked with the ARM's cycle counter.
Speed: when reading / writing big buffers (>1kB)
- 32 bit word access, read / write ~ 50 MB/s (400 Mbit/s)
- byte access, read / write ~ 8 MB/s (66 Mbit/s)
Okay, byte access with a simple for loop takes a few more cycles per byte, but the interesting thing is that with the same clock and a QUAD SPI RAM and byte access I got speeds up to almost 25 MB/s (200 Mbit/s).
Question:
Is that some normal behavior of the peripheral due to different byte / word / ocal / quad handling inside the peripheral?
Or is there some setting I can tune for octal mode?
The one big difference I see in the settings between Hyper / quad is the "refresh rate" which the Hyper RAM needs (a high NCS every now and then to trigger its internal refresh). But even playing around with that didn't change much in byte mode.
- Labels:
-
OctoSPI
-
QSPI
-
RAM
-
STM32H7 Series
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2023-02-14 7:30 AM
Any ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2023-02-16 8:06 AM
So, I've been playing with that HyperRam, and there are some amazing results and insights.
Still running "HyperRAM" (Infineon S70KL1281DABHI023) with 100 MHz.
I wrote a super simple test function, writing and reading the complete RAM with 32-bit pointers, writing index in one straight for loop, then another for loop with reading and comparing.
I use the cycle counter to measure the timing, and because I could not believe the (good) results, I checked also with the (1ms) SysTick.
uint32_t *pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;
/* write complete HyperRAM */
u32TickStart = HAL_GetTick();
u32CycStart = DWT->CYCCNT;
for( i = 0; i < u32MaxLen; i++ )
{
pu32MemAddr[i] = i;
}
u32Cycles = DWT->CYCCNT - u32CycStart;
u32Ticks = HAL_GetTick() - u32TickStart;
snip, some UART output...
/* read complete HyperRAM and compare */
pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;
u32TickStart = HAL_GetTick();
u32CycStart = DWT->CYCCNT;
for( i = 0; i < u32MaxLen; i++ )
{
u32Data = pu32MemAddr[i];
if( u32Data != i ) u32Errors++;
}
u32Cycles = DWT->CYCCNT - u32CycStart;
u32Ticks = HAL_GetTick() - u32TickStart;
Here's the result (UART output):
****** OCTOSPI2 HyperRAM test ******
writing 0x00400000 = 4194304 32bit words ...
16777216 bytes written
35176768 CPU cycles -> 87941.9 us
-> 1526.21 Mbit/s
88 ms (ticks)
-> 1525.20 Mbit/s
reading and comparing 0x00400000 = 4194304 32bit words ...
16777216 bytes read
43689082 CPU cycles -> 109222.7 us
-> 1228.84 Mbit/s
110 ms (ticks)
-> 1220.16 Mbit/s
NULL errors
OctoSpi2HyTest() success, no errors
So... it takes only 2.6 CPU cycles per iteration in average for the reading for-loop?
Is that even possible?
And also the about 2 cycles per iteration for the write loop, reaching almost the theoretical maximum of 100 MHz / 8-bit / DDR of 1600 Mbit/s? Holy Moly...
What am I doing wrong?
I checked the HyperRAM contents, and everywhere I looked, I found the correct content.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2023-02-16 8:26 AM
So, next test was my actual application:
SAI -> DMA -> HyperRAM -> DMA -> Ethernet
And here it "failed":
- 25.6 Mbit/s (4 audio channels, 32-bit, sampling rate 200 kHz) -> (almost) no problems
but...
- 51.2 Mbit/s (8 channels) -> dropouts in audio samples, until the complete buffer management crashes!
So, thinking about that, this is no big surprise:
the SAI DMA constantly pushes data into the HyperRAM (every 5 µs at 200 kHz), so the ETH DMA doesn't really have the time to get complete packets.
And the constant switching between reading and writing at completely different addresses takes even more time.
Next step:
let SAI DMA write into internal SRAM, then copy complete packet into HyperRAM, either via DMA (MEM2MEM), or as it seems now, with a simple for loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2023-02-16 8:34 AM
Right now the results for 1460 bytes (TCP's payload maximum) are:
DMA MEM2MEM (best settings: burst = 16 beats, size = byte)
- read from HyperRAM to AXI RAM: 12666 CPU cycles -> 31.7 us -> 368 Mbit/s
- write HyperRAM from AXI RAM: 9965 CPU cycles -> 24.9 us -> 468 Mbit/s
for loop with 32-bit pointer:
- read from HyperRAM to AXI RAM: 12251 CPU cycles -> 31.6 us -> 392 Mbit/s
- write HyperRAM from AXI RAM: 3059 CPU cycles -> 7.6 us -> 1573 Mbit/s
Again, the simple for loop for writing is insanely fast - unless there's something I'm doing wrong...
I had never expected that this simple copying loop would take less time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2023-02-16 9:11 AM
I'm not super invested in this, but probably want to look at the MPU and caching settings, how it writes back, or folds things.
These types of memory have a lot of latency when used in random-access, rather the page level bursts.
DMA doesn't have write buffering, or caching, just the minimum number of transistors to accomplish the task.
Up vote any posts that you find helpful, it shows what's working..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2023-02-16 9:16 AM
Thanks for your input!
Actually, because I didn't want to worry about cache management for a start, all caches are OFF.
What I don't get is that the for loop is that fast, how can the write loop take only about 2 CPU cycles per iteration?
That some "internal ARM core specialty"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2023-02-16 11:21 PM
And here comes the problem, which I forgot to tell about yesterday because it didn't happen then, but I had seen that before:
The HyperRAM speed is very dependent on compilation, I changed the code a little bit, some place completely unrelated to HyperRAM, and after that compilation I get only:
- writing: 490 Mbit/s
- reading: 456 Mbit/s
- compared to > 1000 Mbit/s yesterday.
What going on there?
Okay, checking the list files... if I can get it back to the higher speed.
