Associate III

Question

Weird cache write-back behavior for STM32F7508

Forum|Forum|6 years ago
February 20, 2020
15 replies
5195 views

I am doing some characterization study for the cache behavior on STM32F7508. The result is hard to understand.

Below is a simple code that I use. I align the array A1 to the cacheline (aligned(32)), and it is placed on SRAM1 above TCM (0x20010000~ ) where the cache is by default WBWA.

Especially, A1 starts from 0x200109c0.

In my code, there is a function (void loop), and a commented out for loop that is functionally equivalent.

I have both versions because using one or the other changes the behavior (described below).

I compiled my code with -O0 (no optimizations).

#include <stdio.h>
#include "printf.h"
#include "clock.h"
#include "stdio.h"
 
#define SIZE (128)
 
__attribute__ ((aligned (32)))
uint32_t A1[SIZE] = {0};
 
void loop(unsigned val)
{
	for (int i = 0; i < SIZE; i++)
		A1[i] = val;
}
 
int main(void)
{
	/* Enable I-Cache */
	SCB_InvalidateICache();
	SCB_EnableICache();
	/* Enable D-Cache */
	SCB_InvalidateDCache();
	SCB_EnableDCache();
 
	HAL_Init();
	init_clock();
	UART_INIT();
 
	// Try toggling PG_6 (D2)
	__HAL_RCC_GPIOG_CLK_ENABLE();
	GPIO_InitTypeDef gpio_init_structure;
	gpio_init_structure.Pin = GPIO_PIN_6;
	gpio_init_structure.Mode = GPIO_MODE_OUTPUT_PP;
	gpio_init_structure.Pull = GPIO_PULLDOWN;
	gpio_init_structure.Speed = GPIO_SPEED_HIGH;
	HAL_GPIO_Init(GPIOG, &gpio_init_structure);
	HAL_GPIO_WritePin(GPIOG, GPIO_PIN_6, GPIO_PIN_RESET);
 
 
	printf("Start\r\n");
	loop(42);
	//for (int i = 0; i < SIZE; ++i)
	//	A1[i] = 42;
	SCB_CleanInvalidateDCache();
	gpioSet();
	loop(79);
	//for (int i = 0; i < SIZE; ++i)
	//	A1[i] = 79;
	gpioReset();
	SCB_InvalidateDCache();
	for (int i = 0; i < SIZE; ++i) {
		printf("%u ", A1[i]);
		if (i % 16 == 15)
			printf("\r\n");
	}
	printf("\r\n");
 
	return 0;
}

What I expect as an output:

Either 8 words being like

79 42 42 42 42 42 42 42 ...

or maybe because of a prefetcher (whose behavior I haven't fully understood yet),

42 42 42 42 42 42 42 42 ...

Reality: sometimes I see

42 42 42 42 42 42 42 42, or

79 79 79 79 79 79 79 79,

but never 79 42 42 42 42 42 42 42 42.

Also the result depends on whether I use the function call or the (commented out) loop.

I will explain why I think this is weird:

After writing 42 to the array, I clean and invalidate the cache.

So, the cache should not contain any data (or maybe the prefetcher filled some of the cache), and memory should contain 42.

Now when I try to write 79 from the second loop, one of the two must happen.

1) Cache miss - If the prefetcher did not fill the cacheline, it must be a cache miss because I invalidated the entire cache. Because it is WBWA, 79 will be written to the memory and the cacheline will be filled. The following 7 accesses will be a cache hit and will only update the cache and will not update the memory.

When this happens, after the second invalidate, the data in the cache will be deleted so it should print 79 42 42 42 42 42 42 42 (only the first write went to the memory because it was a miss).

2) Cache hit - I originally thought cache hit was impossible for aligned accessed, but if a prefetcher filled the cacheline (I'm not sure about how the prefetcher works), I guess it can be a hit. In this case, the code will only update the cache, so when invalidated, all the writes will be deleted and would print: 42 42 42 42 42 42 42 42.

However, the printed result is different depending on whether I use the function call or the (commented out loop). For expressing the combination as (before the first invalidate)-(before the second invalidate):

call-call: all 42 printed

for-call: all 79 printed

call-for: either 42 42 42 42 42 42 42 42 or 79 79 79 79 79 79 79 79 printed (mixed)

for-for: all 79 printed.

So first, it is weird that the behavior changes. Also, the result is deterministic.

(Maybe because the prefetcher changes its behavior?)

Also, what is more weird for me is that 79 42 42 42 42 42 42 42 never prints and instead 79 79 79 79 79 79 79 79 is printed! This is very unexpected from my understanding of WBWA.

I think this may be more understandable if the cache was no allocate on write. However, from the document it says default behavior for 0x20000000 region ins WBWA (I also explicitly tried setting the region into WBWA with the MPU and it was the same).

When I timed the second part after the first cache clean, for-for and call-for runs roughly the same; call-call was slightly faster then for-call.

So, (1) why is it not printing what is expected from WBWA, and (2) why is the result different each time when I use a (functionally equivalent) for loop or a call?

I confirmed by looking at the binary that no other memory access is generated, i.e., nothing is evicting the cache.

This is very confusing for me... Is my cache setting somehow messed up? Or is it because I am misunderstanding something or are there a prefetcher doing something that I do not understand?? Is there something like a write buffer hardware that aggregates the write and updates the entire cacheline at once??

I appreciate any thoughts or comments!

This topic has been closed for replies.

hs2

Visitor II

Seems you got confused with Write Back vs. Write Thru. I think loop(79) just writes to the cache at the 1st place and invalidating the cache right after loop(79) will provide varying results because you didn‘t flush or clean the cache after loop(79).

KMaenAuthor

Associate III

No I understand write back vs. write through. The reason I think it is weird is because the default cache behavior is WBWA (write-back with read and write allocate). On WBWA, the memory is updated on a cache miss write. loop(79)'s write should be a cache miss write because I invalidated the cache before the loop.

"Write-back with write and read allocate: on hits it writes to the cache setting dirty bit for the block, the main memory is not updated. On misses it updates the block in the main memory and brings the block to the cache." from (https://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf#%5B%7B%22num%22%3A16%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2Cnull%2C725.66931%2Cnull%5D)

Edit: On different arm document, WBWA has a different behavior description: "For Write-Back, Write-Allocate stores that miss in the data cache, a linefill is started using either of the two linefill buffers. When the linefill data is returned from the external memory system, the data in the store buffer is merged into the linefill buffer and subsequently written into the cache." From this description it seems like WBWA does not write to the memory directly on a write miss. Now I am getting more confused..

hs2

Visitor II

I try to keep things simple. Hence I don‘t consider processor cache as deterministic memory. It usually isn‘t. At least in a system with interrupts, multiple tasks etc.

I just don‘t care and rely on the processor doing things right. Dealing with cache behavior is hard work and a corner case e.g. doing low-level HW driver programming.

Again for me writing to a WB cached memory location is a write to the cache at the 1st place. If I need to ensure the data is really stored in RAM (e.g. to be read by a DMA later on) I ‘d flush the cache of the touched address range afterwards (taking care about the cacheline granularity resp. alignment) and use a full memory and instruction barrier before continuing the code e.g. starting the DMA.. (mind the prefetcher !)

You see there are a number of constraints you MUST follow when doing manual cache management! Otherwise anything might happen including nasty, non obvious bugs or data corruptions. Therefore it’s sometimes better to simply avoid caching at all e.g. by using non-cached memory for DMAing if there is no big benefit like doing heavy calculations directly on DMA’d data.

KMaenAuthor

Associate III

I am not trying to manually control the cache but I am doing this because I figured that the performance is unexpectedly slow in some cases (https://community.st.com/s/question/0D50X0000C8gKSkSQM/unexpected-performance-when-enabling-cache-with-sram-external-sdram). I tried to track down the reason and came all the way to here. My theory is sometimes cache was getting more miss than it should be and that was why I was inspecting this. As you can see in my experiment, changing the function call to the equivalent loop is suddenly dramatically changing the cache behavior, and I have no idea why.

Because I am inspecting this for performance reason, turning off cache, etc., is not something I can consider.

berendi

Principal

Calling or not calling a function can rather affect the branch prediction at the end of the loop.

Piranha

Principal III

Using SCB_InvalidateDCache() is dangerous and can corrupt internal variables of C runtime, C libraries, HAL or other code. Most likely not the case here because of previous call of SCB_CleanInvalidateDCache() and limited/controlled code in between them, but be careful with whole memory cache operations.

You have misunderstood the write-back vs write-through.

http://www.emcu.it/STM32F7/Slide/Cache2.png

http://www.emcu.it/STM32F7/STM32F7xx.html

AN4839 page 5:

Write-back: the cache does not write the cache contents to the memory until a clean operation is done.
Write-through: triggers a write to the memory as soon as the contents on the cache line are written to. This is safer for the data coherency, but it requires more bus accesses. In practice, the write to the memory is done in the background and has a little effect unless the same cache set is being accessed repeatedly and very quickly. It is always a tradeoff.

AN4839 page 4:

If all the lines are allocated, the cache controller runs the line eviction process, where a line is selected (depending on replacement algorithm) cleaned/invalidated, and reallocated. The data cache and Instruction cache implement a pseudo-random replacement algorithm.

So output kind-of "should" be 42 42 42 42 42 42 42 42, but output of 79 79 79 79 79 79 79 79 shows that eviction happened to that line before SCB_InvalidateDCache(). The dependency of a result on a function call vs local loop also "confirms" that.

berendi

Principal

Looks like the explanation of write-back in AN4838 is misleading, OP is not the first one to fall for it.

I thought there is at least a LRU based eviction algorithm, but pseudo-random explains a lot.

Sign up

Login with SSO

Login to the community

Login with SSO