Weird cache write-back behavior for STM32F7508
I am doing some characterization study for the cache behavior on STM32F7508. The result is hard to understand.
Below is a simple code that I use. I align the array A1 to the cacheline (aligned(32)), and it is placed on SRAM1 above TCM (0x20010000~ ) where the cache is by default WBWA.
Especially, A1 starts from 0x200109c0.
In my code, there is a function (void loop), and a commented out for loop that is functionally equivalent.
I have both versions because using one or the other changes the behavior (described below).
I compiled my code with -O0 (no optimizations).
#include <stdio.h>
#include "printf.h"
#include "clock.h"
#include "stdio.h"
#define SIZE (128)
__attribute__ ((aligned (32)))
uint32_t A1[SIZE] = {0};
void loop(unsigned val)
{
for (int i = 0; i < SIZE; i++)
A1[i] = val;
}
int main(void)
{
/* Enable I-Cache */
SCB_InvalidateICache();
SCB_EnableICache();
/* Enable D-Cache */
SCB_InvalidateDCache();
SCB_EnableDCache();
HAL_Init();
init_clock();
UART_INIT();
// Try toggling PG_6 (D2)
__HAL_RCC_GPIOG_CLK_ENABLE();
GPIO_InitTypeDef gpio_init_structure;
gpio_init_structure.Pin = GPIO_PIN_6;
gpio_init_structure.Mode = GPIO_MODE_OUTPUT_PP;
gpio_init_structure.Pull = GPIO_PULLDOWN;
gpio_init_structure.Speed = GPIO_SPEED_HIGH;
HAL_GPIO_Init(GPIOG, &gpio_init_structure);
HAL_GPIO_WritePin(GPIOG, GPIO_PIN_6, GPIO_PIN_RESET);
printf("Start\r\n");
loop(42);
//for (int i = 0; i < SIZE; ++i)
// A1[i] = 42;
SCB_CleanInvalidateDCache();
gpioSet();
loop(79);
//for (int i = 0; i < SIZE; ++i)
// A1[i] = 79;
gpioReset();
SCB_InvalidateDCache();
for (int i = 0; i < SIZE; ++i) {
printf("%u ", A1[i]);
if (i % 16 == 15)
printf("\r\n");
}
printf("\r\n");
return 0;
}What I expect as an output:
Either 8 words being like
79 42 42 42 42 42 42 42 ...
or maybe because of a prefetcher (whose behavior I haven't fully understood yet),
42 42 42 42 42 42 42 42 ...
Reality: sometimes I see
42 42 42 42 42 42 42 42, or
79 79 79 79 79 79 79 79,
but never 79 42 42 42 42 42 42 42 42.
Also the result depends on whether I use the function call or the (commented out) loop.
I will explain why I think this is weird:
After writing 42 to the array, I clean and invalidate the cache.
So, the cache should not contain any data (or maybe the prefetcher filled some of the cache), and memory should contain 42.
Now when I try to write 79 from the second loop, one of the two must happen.
1) Cache miss - If the prefetcher did not fill the cacheline, it must be a cache miss because I invalidated the entire cache. Because it is WBWA, 79 will be written to the memory and the cacheline will be filled. The following 7 accesses will be a cache hit and will only update the cache and will not update the memory.
When this happens, after the second invalidate, the data in the cache will be deleted so it should print 79 42 42 42 42 42 42 42 (only the first write went to the memory because it was a miss).
2) Cache hit - I originally thought cache hit was impossible for aligned accessed, but if a prefetcher filled the cacheline (I'm not sure about how the prefetcher works), I guess it can be a hit. In this case, the code will only update the cache, so when invalidated, all the writes will be deleted and would print: 42 42 42 42 42 42 42 42.
However, the printed result is different depending on whether I use the function call or the (commented out loop). For expressing the combination as (before the first invalidate)-(before the second invalidate):
call-call: all 42 printed
for-call: all 79 printed
call-for: either 42 42 42 42 42 42 42 42 or 79 79 79 79 79 79 79 79 printed (mixed)
for-for: all 79 printed.
So first, it is weird that the behavior changes. Also, the result is deterministic.
(Maybe because the prefetcher changes its behavior?)
Also, what is more weird for me is that 79 42 42 42 42 42 42 42 never prints and instead 79 79 79 79 79 79 79 79 is printed! This is very unexpected from my understanding of WBWA.
I think this may be more understandable if the cache was no allocate on write. However, from the document it says default behavior for 0x20000000 region ins WBWA (I also explicitly tried setting the region into WBWA with the MPU and it was the same).
When I timed the second part after the first cache clean, for-for and call-for runs roughly the same; call-call was slightly faster then for-call.
So, (1) why is it not printing what is expected from WBWA, and (2) why is the result different each time when I use a (functionally equivalent) for loop or a call?
I confirmed by looking at the binary that no other memory access is generated, i.e., nothing is evicting the cache.
This is very confusing for me... Is my cache setting somehow messed up? Or is it because I am misunderstanding something or are there a prefetcher doing something that I do not understand?? Is there something like a write buffer hardware that aggregates the write and updates the entire cacheline at once??
I appreciate any thoughts or comments!