Weird cache write-back behavior for STM32F7508

KMaen · ‎2020-02-20

I am doing some characterization study for the cache behavior on STM32F7508. The result is hard to understand.

Below is a simple code that I use. I align the array A1 to the cacheline (aligned(32)), and it is placed on SRAM1 above TCM (0x20010000~ ) where the cache is by default WBWA.

Especially, A1 starts from 0x200109c0.

In my code, there is a function (void loop), and a commented out for loop that is functionally equivalent.

I have both versions because using one or the other changes the behavior (described below).

I compiled my code with -O0 (no optimizations).

#include <stdio.h>
#include "printf.h"
#include "clock.h"
#include "stdio.h"
 
#define SIZE (128)
 
__attribute__ ((aligned (32)))
uint32_t A1[SIZE] = {0};
 
void loop(unsigned val)
{
	for (int i = 0; i < SIZE; i++)
		A1[i] = val;
}
 
int main(void)
{
	/* Enable I-Cache */
	SCB_InvalidateICache();
	SCB_EnableICache();
	/* Enable D-Cache */
	SCB_InvalidateDCache();
	SCB_EnableDCache();
 
	HAL_Init();
	init_clock();
	UART_INIT();
 
	// Try toggling PG_6 (D2)
	__HAL_RCC_GPIOG_CLK_ENABLE();
	GPIO_InitTypeDef gpio_init_structure;
	gpio_init_structure.Pin   = GPIO_PIN_6;
	gpio_init_structure.Mode      = GPIO_MODE_OUTPUT_PP;
	gpio_init_structure.Pull      = GPIO_PULLDOWN;
	gpio_init_structure.Speed     = GPIO_SPEED_HIGH;
	HAL_GPIO_Init(GPIOG, &gpio_init_structure);
	HAL_GPIO_WritePin(GPIOG, GPIO_PIN_6, GPIO_PIN_RESET);
 
 
	printf("Start\r\n");
	loop(42);
	//for (int i = 0; i < SIZE; ++i)
	//	A1[i] = 42;
	SCB_CleanInvalidateDCache();
	gpioSet();
	loop(79);
	//for (int i = 0; i < SIZE; ++i)
	//	A1[i] = 79;
	gpioReset();
	SCB_InvalidateDCache();
	for (int i = 0; i < SIZE; ++i) {
		printf("%u ", A1[i]);
		if (i % 16 == 15)
			printf("\r\n");
	}
	printf("\r\n");
 
	return 0;
}

What I expect as an output:

Either 8 words being like

79 42 42 42 42 42 42 42 ...

or maybe because of a prefetcher (whose behavior I haven't fully understood yet),

42 42 42 42 42 42 42 42 ...

Reality: sometimes I see

42 42 42 42 42 42 42 42, or

79 79 79 79 79 79 79 79,

but never 79 42 42 42 42 42 42 42 42.

Also the result depends on whether I use the function call or the (commented out) loop.

I will explain why I think this is weird:

After writing 42 to the array, I clean and invalidate the cache.

So, the cache should not contain any data (or maybe the prefetcher filled some of the cache), and memory should contain 42.

Now when I try to write 79 from the second loop, one of the two must happen.

1) Cache miss - If the prefetcher did not fill the cacheline, it must be a cache miss because I invalidated the entire cache. Because it is WBWA, 79 will be written to the memory and the cacheline will be filled. The following 7 accesses will be a cache hit and will only update the cache and will not update the memory.

When this happens, after the second invalidate, the data in the cache will be deleted so it should print 79 42 42 42 42 42 42 42 (only the first write went to the memory because it was a miss).

2) Cache hit - I originally thought cache hit was impossible for aligned accessed, but if a prefetcher filled the cacheline (I'm not sure about how the prefetcher works), I guess it can be a hit. In this case, the code will only update the cache, so when invalidated, all the writes will be deleted and would print: 42 42 42 42 42 42 42 42.

However, the printed result is different depending on whether I use the function call or the (commented out loop). For expressing the combination as (before the first invalidate)-(before the second invalidate):

call-call: all 42 printed

for-call: all 79 printed

call-for: either 42 42 42 42 42 42 42 42 or 79 79 79 79 79 79 79 79 printed (mixed)

for-for: all 79 printed.

So first, it is weird that the behavior changes. Also, the result is deterministic.

(Maybe because the prefetcher changes its behavior?)

Also, what is more weird for me is that 79 42 42 42 42 42 42 42 never prints and instead 79 79 79 79 79 79 79 79 is printed! This is very unexpected from my understanding of WBWA.

I think this may be more understandable if the cache was no allocate on write. However, from the document it says default behavior for 0x20000000 region ins WBWA (I also explicitly tried setting the region into WBWA with the MPU and it was the same).

When I timed the second part after the first cache clean, for-for and call-for runs roughly the same; call-call was slightly faster then for-call.

So, (1) why is it not printing what is expected from WBWA, and (2) why is the result different each time when I use a (functionally equivalent) for loop or a call?

I confirmed by looking at the binary that no other memory access is generated, i.e., nothing is evicting the cache.

This is very confusing for me... Is my cache setting somehow messed up? Or is it because I am misunderstanding something or are there a prefetcher doing something that I do not understand?? Is there something like a write buffer hardware that aggregates the write and updates the entire cacheline at once??

I appreciate any thoughts or comments!

hs2 · ‎2020-02-20

Seems you got confused with Write Back vs. Write Thru. I think loop(79) just writes to the cache at the 1st place and invalidating the cache right after loop(79) will provide varying results because you didn‘t flush or clean the cache after loop(79).

KMaen · ‎2020-02-20

No I understand write back vs. write through. The reason I think it is weird is because the default cache behavior is WBWA (write-back with read and write allocate). On WBWA, the memory is updated on a cache miss write. loop(79)'s write should be a cache miss write because I invalidated the cache before the loop.

"Write-back with write and read allocate: on hits it writes to the cache setting dirty bit for the block, the main memory is not updated. On misses it updates the block in the main memory and brings the block to the cache." from (https://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf#%5B%7B%22num%22%3A16%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2Cnull%2C725.66931%2Cnull%5D)

Edit: On different arm document, WBWA has a different behavior description: "For Write-Back, Write-Allocate stores that miss in the data cache, a linefill is started using either of the two linefill buffers. When the linefill data is returned from the external memory system, the data in the store buffer is merged into the linefill buffer and subsequently written into the cache." From this description it seems like WBWA does not write to the memory directly on a write miss. Now I am getting more confused..

KMaen · ‎2020-02-20

Also, I cannot understand why the behavior differs when only the code changed and the data access pattern is essentially the same.

hs2 · ‎2020-02-21

I try to keep things simple. Hence I don‘t consider processor cache as deterministic memory. It usually isn‘t. At least in a system with interrupts, multiple tasks etc.

I just don‘t care and rely on the processor doing things right. Dealing with cache behavior is hard work and a corner case e.g. doing low-level HW driver programming.

Again for me writing to a WB cached memory location is a write to the cache at the 1st place. If I need to ensure the data is really stored in RAM (e.g. to be read by a DMA later on) I ‘d flush the cache of the touched address range afterwards (taking care about the cacheline granularity resp. alignment) and use a full memory and instruction barrier before continuing the code e.g. starting the DMA.. (mind the prefetcher !)

You see there are a number of constraints you MUST follow when doing manual cache management! Otherwise anything might happen including nasty, non obvious bugs or data corruptions. Therefore it’s sometimes better to simply avoid caching at all e.g. by using non-cached memory for DMAing if there is no big benefit like doing heavy calculations directly on DMA’d data.

KMaen · ‎2020-02-21

I am not trying to manually control the cache but I am doing this because I figured that the performance is unexpectedly slow in some cases (https://community.st.com/s/question/0D50X0000C8gKSkSQM/unexpected-performance-when-enabling-cache-with-sram-external-sdram). I tried to track down the reason and came all the way to here. My theory is sometimes cache was getting more miss than it should be and that was why I was inspecting this. As you can see in my experiment, changing the function call to the equivalent loop is suddenly dramatically changing the cache behavior, and I have no idea why.

Because I am inspecting this for performance reason, turning off cache, etc., is not something I can consider.

berendi · ‎2020-02-21

You have empirically proved that the first of the two conflicting explanations is wrong.
If you refactor a piece of inline code to a function call, it will definitely change the data access pattern. A couple of registers are pushed to the stack, possibly touching two cache lines. Unoptimized code uses the stack a lot, unoptimized function calls even more.

KMaen · ‎2020-02-24

For 2, I checked the assembly code and the function call does not push anything to the stack (because it is small). From what I know ARM does not automatically push anything to the stack on function call unless explicitly pushed with a push instruction, right? (c.f., some TI chips uses the stack to store the return address by default, always using the stack on a function call; I'm not aware of any such behavior for ARM).

Also if the function call is the problem than it is even more weird, because from my result it seems like cache misses occur when there is no function call!! (for-for case is printing all 79)

berendi · ‎2020-02-27

> I checked the assembly code and the function call does not push anything to the stack (because it is small).

It is the called function that uses the stack.

https://godbolt.org/z/r6XbqW

With my abridged test code, it pushes one register, then allocates 20 more bytes to store just two variables.

KMaen · ‎2020-02-27

Hmm... I checked the assembly I generate and it does not push or use stack. The resulting code is quite different, even if it is -O0.

I am using clang instead of gcc, and maybe that is the difference. Still weird... (is clang -O0 somehow much more optimized than gcc -O0?).