Ext Flash w/ QSPI seems like it is not benefitting as much from the cache for some reason.
I am using STM32F7508-DK and using Ext Flash connected to QSPI. It is mapped to 0x90000000~ region which is by default covered with cache, but for some reason, it seems like it is not benefitting as much compared to on-chip SRAM or SDRAM connected with FMC. I think QSPI is seeing a smaller portion of cache or is getting kicked out of the cache more often for a reason I do not know.
I will explain briefly why I think it is weird. I am running the following code:
int A[1024]; // placed on SRAM, SDRAM, or Ext-Flash
volatile int a = 0;
for (i=0;i<10000;++i)
for (j=0;j<1024;j+=8)
a += A[j];A is 4KB, and STM32F7508-DK also has a 4KB D-cache. That means the entire A must nicely fit inside the D-cache.
What I am doing here is bringing in the entire array to the cache (by touching every cache line with j+=8), and repeatedly accessing the array 10,000 times. I am accessing it 10,000 times to amortize the cost of the first cold-miss. No matter if A is in SRAM, SDRAM, or Ext-flash, if this code only accesses the cache most of the time, the execution time must be the same for all. I am using memory-mapped mode for both SDRAM and QSPI Flash.
This is the resulting execution time:
(With D-cache enabled)
SRAM: 32ms
SDRAM: 32ms
QSPI Ext-Flash: 43ms
Just for reference, this is the number without D-cache enabled
SRAM: 33ms
SDRAM: 133ms
QSPI Ext-Flash: 602ms
As you can see, SRAM and SDRAM performance becomes the same with cache on. That means the entire A is residing inside the cache for SRAM/SDRAM.
However, as you can also see, QSPI Ext-Flash performance is not as good. It is definitely benefitting from the cache (602ms > 43ms), but it does not go down to 32ms.
If the entire A is brought to the cache, this must go down to 32ms. So the only explanation is that it is not benefitting as much from the cache.
(Is cache for QSPI limited to less than 4KB? Is some part of the QSPI memory space (0x90000000~0x90000400) not covered by cache? Is something like a prefetcher kicking out the data in the cache?)
So my questions are:
- Does anyone see the same effect on their board? Is this specific to my QSPI configuration or cache configuration, somehow? (I used mostly the default setup and using BSP libs)
- Does anyone have an explanation for why this is happening?
I think the cache design is not open, so I searched a lot but could not make sense out of this. If anyone has any ideas on why this might be happening, I would very much appreciate the idea. Thank you!
[EDIT: Changed the application to be read-only and changed the numbers correspondingly]