2020-02-17 03:10 PM
I am using STM32F7508-DK and comparing the memory performance of its internal SRAM vs. the external SDRAM connected to FMC (starting from 0xC000_0000).
When I turn on the cache, the performance behavior is unexpected. Especially, The external SDRAM is slower than expected and I cannot understand why.
I am simply running something like this:
for (i = 0; i < SIZE; i++)
arr[i]++;
Where arr is either in internal SRAM or external SDRAM.
From my understanding, starting from 0x2000_0000 where the internal SRAM reside, the first 64K is somehow faster than the rest (RAM_CPU), so I am putting arr starting from 0x2001_0000.
Because region starting from 0xC000_0000 is not cacheable by default, I am switching the address space to 0x6000_0000 with HAL_EnableFMCMemorySwapping().
From what I see in the datasheet, region 0x2001_0000 and 0x6000_0000 has the same default cache policy (WBWA), so I thought it should be fine.
Below graph shows the result of my execution, while varying the size of arr.
As you can see, all graphs shows linear execution time increase as the array size gets bigger. This is expected; if the array size gets x2, the execution time also gets x2. Also, for both SRAM and SDRAM, cache ON is faster, as expected.
However, SDRAM when cache ON (orange line), shows a weird non-linear behavior. As you can see, array size x2 increases the execution time more than x2, especially when the array size is small. When the array size is larger, it is mostly x2.
Because of the non-linear behavior, the SDRAM gets more penalty as the array gets larger, when cache is ON.
Below graph shows the execution time of SDRAM divided by SRAM on various array size, both with cache ON and OFF.
As you can see, when the cache is off, the performance ratio is roughly a constant. When you turn on the cache, The slowdown gets larger as the arr size grows, even getting larger than the cache-off case.
From my understanding of how a cache works, this should not happen.
My processor has a 4-way, 4KB D-cache with a cacheline size 32B. With that, it seems like on a streaming cache access, one miss is followed by 7 hit (the elements of the arr is 4B), and the total execution time would be something roughly like
ARR_SIZE / 8 * (1*MISS_LATENCY + 7*HIT_LATENCY)
and the execution time ratio must be
(1*MISS_LATENCY_SDRAM + 7*HIT_LATENCY) / (1*MISS_LATENCY_SRAM + 7*HIT_LATENCY),
which is not a function of ARR_SIZE.
So why is my SDRAM "comparably" becoming slower as the ARR_SIZE gets larger?
Is there some implementation-specific thing in the FMC that makes it less cache-friendly than SRAM? Or am I possibly setting some cache wrongly?
Any ideas or help appreciated.
Thank you!
2020-02-18 12:17 AM
As long as the array size is smaller that the cache size, you can expect the whole array stay in the cache.
If the array is in initialized memory, or you are making repeated tests, it will affect execution time.
2020-02-18 07:04 AM
I am initializing the array in the code, before the loop. So, it is even more weird, because as you said when the array is smaller than the cache it must stay in the cache.
As you can see, when the array size is smaller than the cache, SDRAM is still slower than SRAM (not the same, even though both should be in the cache). Also, as the array size gets larger, SDRAM gets slower than the SRAM comparably.
2020-02-18 08:06 AM
Oh, I interpreted the second graph the wrong way. Strange indeed.
Is size a const/define or a runtime parameter? I've observed the compiler generating slightly different code when I've changed #define SIZE.
2020-02-19 07:45 PM
I checked and the code is not changing. From: https://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf here it hints an existence of a "speculative prefetch feature". I wonder if this is currently on, causing unpredictable behavior. Does anyone know if I can turn this off?