Ext Flash w/ QSPI seems like it is not benefitting as much from the cache for some reason.

KMaen · ‎2021-04-22

I am using STM32F7508-DK and using Ext Flash connected to QSPI. It is mapped to 0x90000000~ region which is by default covered with cache, but for some reason, it seems like it is not benefitting as much compared to on-chip SRAM or SDRAM connected with FMC. I think QSPI is seeing a smaller portion of cache or is getting kicked out of the cache more often for a reason I do not know.

I will explain briefly why I think it is weird. I am running the following code:

int A[1024];  // placed on SRAM, SDRAM, or Ext-Flash
volatile int a = 0;
for (i=0;i<10000;++i)
  for (j=0;j<1024;j+=8)
    a += A[j];

A is 4KB, and STM32F7508-DK also has a 4KB D-cache. That means the entire A must nicely fit inside the D-cache.

What I am doing here is bringing in the entire array to the cache (by touching every cache line with j+=8), and repeatedly accessing the array 10,000 times. I am accessing it 10,000 times to amortize the cost of the first cold-miss. No matter if A is in SRAM, SDRAM, or Ext-flash, if this code only accesses the cache most of the time, the execution time must be the same for all. I am using memory-mapped mode for both SDRAM and QSPI Flash.

This is the resulting execution time:

(With D-cache enabled)

SRAM: 32ms

SDRAM: 32ms

QSPI Ext-Flash: 43ms

Just for reference, this is the number without D-cache enabled

SRAM: 33ms

SDRAM: 133ms

QSPI Ext-Flash: 602ms

As you can see, SRAM and SDRAM performance becomes the same with cache on. That means the entire A is residing inside the cache for SRAM/SDRAM.

However, as you can also see, QSPI Ext-Flash performance is not as good. It is definitely benefitting from the cache (602ms > 43ms), but it does not go down to 32ms.

If the entire A is brought to the cache, this must go down to 32ms. So the only explanation is that it is not benefitting as much from the cache.

(Is cache for QSPI limited to less than 4KB? Is some part of the QSPI memory space (0x90000000~0x90000400) not covered by cache? Is something like a prefetcher kicking out the data in the cache?)

So my questions are:

Does anyone see the same effect on their board? Is this specific to my QSPI configuration or cache configuration, somehow? (I used mostly the default setup and using BSP libs)
Does anyone have an explanation for why this is happening?

I think the cache design is not open, so I searched a lot but could not make sense out of this. If anyone has any ideas on why this might be happening, I would very much appreciate the idea. Thank you!

[EDIT: Changed the application to be read-only and changed the numbers correspondingly]

Tesla DeLorean · ‎2021-04-22

Check the MPU configuration for the QSPI region

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

KMaen · ‎2021-04-22

I left the MPU as default, and according to this: https://iq.direct/datasheets/STM32x7%20Cache%20Appnote.pdf QSPI region has cache enabled by default.

Tesla DeLorean · ‎2021-04-22

The F72x/73x had a newer M7 core and a larger cache

Perhaps put the array in DTCMRAM, that way it's not consuming cache space, and not contending the AHB/APB

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

KMaen · ‎2021-04-22

Update to this: I spent some time today with the logic analyzer to see what was going on with the QSPI Flash.

There was indeed a prefetcher involved (If anyone's interested, when streaming data, it was first fetching 2 cache lines, and on each access, prefetching the next cache line in stream access pattern. So at a row boundary, it unnecessarily prefetches one additional cache line that it won't use). However, it was prefetching at most 20% of unnecessary data in my case, which should not cause the above problem (My array size is 1KB and the cache size is 4KB).

Since Flash memory-mapped mode is read-only, I wasn't sure when the data was getting evicted (with probing SDRAM and seeing when the dirty data gets evicted, maybe I'll have more luck understanding this behavior--but the tip of SDRAM was not exposed).

Anyways, I confirmed that data that is supposed to be in the cache already was keep getting fetched again, which means there are too many cache misses than what I would expect.

At first, I thought QSPI only uses a small fraction of the cache (e.g., 1KB). However, this weird behavior only happens when we are mixing row-column accesses. If I am just streaming in 1-dim, both SDRAM/Flash fully utilizes the 4KB cache. So the only explanation is that QSPI Flash data gets kicked out from the cache a lot for some unknown reason other than the prefetcher...

Maybe on prefetcher misprediction, it mass evicts the data in the cache? I think that would be a bad idea so I don't know...