2021-04-22 10:57 AM
I am using STM32F7508-DK and using Ext Flash connected to QSPI. It is mapped to 0x90000000~ region which is by default covered with cache, but for some reason, it seems like it is not benefitting as much compared to on-chip SRAM or SDRAM connected with FMC. I think QSPI is seeing a smaller portion of cache or is getting kicked out of the cache more often for a reason I do not know.
I will explain briefly why I think it is weird. I am running the following code:
int A[1024]; // placed on SRAM, SDRAM, or Ext-Flash
volatile int a = 0;
for (i=0;i<10000;++i)
for (j=0;j<1024;j+=8)
a += A[j];
A is 4KB, and STM32F7508-DK also has a 4KB D-cache. That means the entire A must nicely fit inside the D-cache.
What I am doing here is bringing in the entire array to the cache (by touching every cache line with j+=8), and repeatedly accessing the array 10,000 times. I am accessing it 10,000 times to amortize the cost of the first cold-miss. No matter if A is in SRAM, SDRAM, or Ext-flash, if this code only accesses the cache most of the time, the execution time must be the same for all. I am using memory-mapped mode for both SDRAM and QSPI Flash.
This is the resulting execution time:
(With D-cache enabled)
SRAM: 32ms
SDRAM: 32ms
QSPI Ext-Flash: 43ms
Just for reference, this is the number without D-cache enabled
SRAM: 33ms
SDRAM: 133ms
QSPI Ext-Flash: 602ms
As you can see, SRAM and SDRAM performance becomes the same with cache on. That means the entire A is residing inside the cache for SRAM/SDRAM.
However, as you can also see, QSPI Ext-Flash performance is not as good. It is definitely benefitting from the cache (602ms > 43ms), but it does not go down to 32ms.
If the entire A is brought to the cache, this must go down to 32ms. So the only explanation is that it is not benefitting as much from the cache.
(Is cache for QSPI limited to less than 4KB? Is some part of the QSPI memory space (0x90000000~0x90000400) not covered by cache? Is something like a prefetcher kicking out the data in the cache?)
So my questions are:
I think the cache design is not open, so I searched a lot but could not make sense out of this. If anyone has any ideas on why this might be happening, I would very much appreciate the idea. Thank you!
[EDIT: Changed the application to be read-only and changed the numbers correspondingly]
2021-04-22 12:20 PM
Check the MPU configuration for the QSPI region
2021-04-22 02:04 PM
I left the MPU as default, and according to this: https://iq.direct/datasheets/STM32x7%20Cache%20Appnote.pdf QSPI region has cache enabled by default.
2021-04-22 02:37 PM
The F72x/73x had a newer M7 core and a larger cache
Perhaps put the array in DTCMRAM, that way it's not consuming cache space, and not contending the AHB/APB
2021-04-22 09:41 PM
Update to this: I spent some time today with the logic analyzer to see what was going on with the QSPI Flash.
There was indeed a prefetcher involved (If anyone's interested, when streaming data, it was first fetching 2 cache lines, and on each access, prefetching the next cache line in stream access pattern. So at a row boundary, it unnecessarily prefetches one additional cache line that it won't use). However, it was prefetching at most 20% of unnecessary data in my case, which should not cause the above problem (My array size is 1KB and the cache size is 4KB).
Since Flash memory-mapped mode is read-only, I wasn't sure when the data was getting evicted (with probing SDRAM and seeing when the dirty data gets evicted, maybe I'll have more luck understanding this behavior--but the tip of SDRAM was not exposed).
Anyways, I confirmed that data that is supposed to be in the cache already was keep getting fetched again, which means there are too many cache misses than what I would expect.
At first, I thought QSPI only uses a small fraction of the cache (e.g., 1KB). However, this weird behavior only happens when we are mixing row-column accesses. If I am just streaming in 1-dim, both SDRAM/Flash fully utilizes the 4KB cache. So the only explanation is that QSPI Flash data gets kicked out from the cache a lot for some unknown reason other than the prefetcher...
Maybe on prefetcher misprediction, it mass evicts the data in the cache? I think that would be a bad idea so I don't know...