Weird cache behavior in SDRAM+FMC. Is the prefetcher doing something weird?

KMaen · ‎2020-10-16

I was playing with SDRAM and found a very strange cache behavior. Can anyone help me understand what is happening?

I am using STM32F7508-DK board. The SDRAM is connected with FMC.

I am basically going over a small 2-d tile inside a 2-d array. The tile is 4*8*8 Byte, or 8 cache line.

I am accessing every word (4Byte) inside the tile, accessing 64 elements.

The tile is much smaller than the L1 cache size (4KB), so no eviction should occur, and the code should experience 8 cache misses (cold miss) + 56 cache hits.

This is exactly what I see if I visit the memory consecutively.

If I access with a stride, it works as expected when the total array size is small, but suddenly stops to work as expected with larger array (but I still only access the same-sized tile!).

Below is my simplified code:

#define ROW_SIZE 8
#define COL_SIZE 128 // 256
#define TILE 8
slow_mem uint32_t A1[ROW_SIZE][COL_SIZE];
 
int main(void)
{
        init();
        HAL_EnableFMCMemorySwapping();
        for (int i = 0; i < TILE; ++i) {
                for (int j = 0; j < TILE; ++j) {
                        gpioSet();
                        A1[i][j]++;
                        gpioReset();
                }
        }
        SCB_CleanInvalidateDCache();
        for (int i = 0; i < TILE; ++i) {
                for (int j = 0; j < TILE; ++j) {
                        gpioSet();
                        A1[j][i]++;
                        gpioReset();
                }
        }
 
        return 0;
}

slow_mem places the A1 array inside the SDRAM. SDRAM is mapped to 0x60000000 to enable cache. Cache is enabled and SDRAM RBURST is turned off.

As you can see, I only ever access A[0][0] -- A[7][7].

The first loop accesses in the order of A[0][0]->A[0][1]->... and the second loop in A[0][0]->A[1][0]->...

Because the cache line size is 32B, each row of the tile (e.g., A[i][0] -- A[i][7]) fits exactly one cache line.

Again, according to what I learned from the architecture class, both should have 8 cache misses and 56 cache hits, and this is indeed what I see when COL_SIZE=128.

Figure 1. Memory access for the first loop, with COL_SIZE=128, 256.

As expected, accessing in the order of A[0][0]->A[0][1]->... results in a cache miss (slightly wider pulse, red) followed by 7 consecutive cache hits (narrower pulse, blue).

Figure 2. Memory access for the second loop, with COL_SIZE=128.

Again, as expected, accessing in the order of A[0][0]->A[1][0]->... results in 8 consecutive cache miss (A[0][0] -- A[7][0]) followed by 56 consecutive cache hits. Just like from the textbook knowledge!

However, when I increase the COL_SIZE, the result does not make sense anymore.

Figure 3. Memory access for the second loop, with COL_SIZE=256

Now there is no cache hit/miss pattern anymore, and also each pulse is even wider than the cache miss of Figure 1/2. For the reference, cache miss in Figure 1 was ~0.2us, cache hit was ~0.1us, and these pulses in Figure 3 are ~0.38us.

Just to reiterate, I only ever access A[0][0]--A[7][7]! Why would the COL_SIZE even matter if it is 128 or 256? Also, what is the device doing on Figure 3? Why is each pulse even wider than the cache miss case? The behavior is very interesting and very counter-intuitive to me.

I am digging this behavior because essentially, this behavior leads to a slowdown if I do column-major access of a large array stored in SDRAM. It has a real performance implication! Also, I somehow became obsessed with this problem so I really want to know the answer :).

Changing COL_SIZE from 128 to 256 makes A1 larger than the L1 cache size. I wonder if that fact is somehow relevant (Still, I only access a tiny tile of it, so I don't see how it can matter).

Another theory is that the hardware prefetcher is somehow messing things up. Although, I still do not understand how a prefetcher would degrade a performance in such a simple and universal case of accessing a matrix column-major inside a tile. Also, I thought prefetchers do not directly evict the cache (and instead write data inside a dedicated HW structure such as stream buffer).

Can anyone help me explain this phenomenon? Can anyone confirm or deny if it is the prefetcher's problem?

Any discussion is highly appreciated.

Thank you!

KMaen · ‎2021-04-22

Update to this: I spent some time today with the logic analyzer to see what was going on with the QSPI Flash. (SDRAM's tip was not exposed and QSPI had similar cache problem so I looked at the QSPI flash instead)

There was indeed a prefetcher involved (If anyone's interested, when streaming data, it was first fetching 2 cache lines, and on each access, prefetching the next cache line in stream access pattern. So at a row boundary, it unnecessarily prefetches one additional cache line that it won't use). However, it was prefetching at most 20% of unnecessary data in my case, which should not cause the above problem (My array size is 1KB and the cache size is 4KB).

Since Flash memory-mapped mode is read-only, I wasn't sure when the data was getting evicted (with probing SDRAM and seeing when the dirty data gets evicted, maybe I'll have more luck understanding this behavior).

Anyways, I confirmed that data that is supposed to be in the cache already was keep getting fetched again, which means there are too many cache misses than what I would expect.

At first, I thought QSPI only uses a small fraction of the cache (e.g., 1KB). However, this weird behavior only happens when we are mixing row-column accesses. If I am just streaming in 1-dim, both SDRAM/Flash fully utilizes the 4KB cache. So the only explanation is that QSPI Flash data gets kicked out from the cache a lot for some unknown reason other than the prefetcher...

Maybe on prefetcher misprediction, it mass evicts the data in the cache? I think that would be a bad idea so I don't know...