2020-02-20 01:28 PM
I am doing some characterization study for the cache behavior on STM32F7508. The result is hard to understand.
Below is a simple code that I use. I align the array A1 to the cacheline (aligned(32)), and it is placed on SRAM1 above TCM (0x20010000~ ) where the cache is by default WBWA.
Especially, A1 starts from 0x200109c0.
In my code, there is a function (void loop), and a commented out for loop that is functionally equivalent.
I have both versions because using one or the other changes the behavior (described below).
I compiled my code with -O0 (no optimizations).
#include <stdio.h>
#include "printf.h"
#include "clock.h"
#include "stdio.h"
#define SIZE (128)
__attribute__ ((aligned (32)))
uint32_t A1[SIZE] = {0};
void loop(unsigned val)
{
for (int i = 0; i < SIZE; i++)
A1[i] = val;
}
int main(void)
{
/* Enable I-Cache */
SCB_InvalidateICache();
SCB_EnableICache();
/* Enable D-Cache */
SCB_InvalidateDCache();
SCB_EnableDCache();
HAL_Init();
init_clock();
UART_INIT();
// Try toggling PG_6 (D2)
__HAL_RCC_GPIOG_CLK_ENABLE();
GPIO_InitTypeDef gpio_init_structure;
gpio_init_structure.Pin = GPIO_PIN_6;
gpio_init_structure.Mode = GPIO_MODE_OUTPUT_PP;
gpio_init_structure.Pull = GPIO_PULLDOWN;
gpio_init_structure.Speed = GPIO_SPEED_HIGH;
HAL_GPIO_Init(GPIOG, &gpio_init_structure);
HAL_GPIO_WritePin(GPIOG, GPIO_PIN_6, GPIO_PIN_RESET);
printf("Start\r\n");
loop(42);
//for (int i = 0; i < SIZE; ++i)
// A1[i] = 42;
SCB_CleanInvalidateDCache();
gpioSet();
loop(79);
//for (int i = 0; i < SIZE; ++i)
// A1[i] = 79;
gpioReset();
SCB_InvalidateDCache();
for (int i = 0; i < SIZE; ++i) {
printf("%u ", A1[i]);
if (i % 16 == 15)
printf("\r\n");
}
printf("\r\n");
return 0;
}
What I expect as an output:
Either 8 words being like
79 42 42 42 42 42 42 42 ...
or maybe because of a prefetcher (whose behavior I haven't fully understood yet),
42 42 42 42 42 42 42 42 ...
Reality: sometimes I see
42 42 42 42 42 42 42 42, or
79 79 79 79 79 79 79 79,
but never 79 42 42 42 42 42 42 42 42.
Also the result depends on whether I use the function call or the (commented out) loop.
I will explain why I think this is weird:
After writing 42 to the array, I clean and invalidate the cache.
So, the cache should not contain any data (or maybe the prefetcher filled some of the cache), and memory should contain 42.
Now when I try to write 79 from the second loop, one of the two must happen.
1) Cache miss - If the prefetcher did not fill the cacheline, it must be a cache miss because I invalidated the entire cache. Because it is WBWA, 79 will be written to the memory and the cacheline will be filled. The following 7 accesses will be a cache hit and will only update the cache and will not update the memory.
When this happens, after the second invalidate, the data in the cache will be deleted so it should print 79 42 42 42 42 42 42 42 (only the first write went to the memory because it was a miss).
2) Cache hit - I originally thought cache hit was impossible for aligned accessed, but if a prefetcher filled the cacheline (I'm not sure about how the prefetcher works), I guess it can be a hit. In this case, the code will only update the cache, so when invalidated, all the writes will be deleted and would print: 42 42 42 42 42 42 42 42.
However, the printed result is different depending on whether I use the function call or the (commented out loop). For expressing the combination as (before the first invalidate)-(before the second invalidate):
call-call: all 42 printed
for-call: all 79 printed
call-for: either 42 42 42 42 42 42 42 42 or 79 79 79 79 79 79 79 79 printed (mixed)
for-for: all 79 printed.
So first, it is weird that the behavior changes. Also, the result is deterministic.
(Maybe because the prefetcher changes its behavior?)
Also, what is more weird for me is that 79 42 42 42 42 42 42 42 never prints and instead 79 79 79 79 79 79 79 79 is printed! This is very unexpected from my understanding of WBWA.
I think this may be more understandable if the cache was no allocate on write. However, from the document it says default behavior for 0x20000000 region ins WBWA (I also explicitly tried setting the region into WBWA with the MPU and it was the same).
When I timed the second part after the first cache clean, for-for and call-for runs roughly the same; call-call was slightly faster then for-call.
So, (1) why is it not printing what is expected from WBWA, and (2) why is the result different each time when I use a (functionally equivalent) for loop or a call?
I confirmed by looking at the binary that no other memory access is generated, i.e., nothing is evicting the cache.
This is very confusing for me... Is my cache setting somehow messed up? Or is it because I am misunderstanding something or are there a prefetcher doing something that I do not understand?? Is there something like a write buffer hardware that aggregates the write and updates the entire cacheline at once??
I appreciate any thoughts or comments!
2020-02-27 06:46 AM
Oh, I guess my Makefile was not completely correct and part of the LLVM, especially the llc part wasn't getting the -O0 flag, so it was optimizing away some code (in the backend). I'll rerun with -O0 everywhere and see if it makes any difference. I still think the previous result is weird because I was looking at the assembly and it was doing things that wasn't obvious... but maybe -Os was doing something unexpected.
Thanks!
2020-02-28 05:33 AM
Calling or not calling a function can rather affect the branch prediction at the end of the loop.
2020-02-29 10:29 AM
Using SCB_InvalidateDCache() is dangerous and can corrupt internal variables of C runtime, C libraries, HAL or other code. Most likely not the case here because of previous call of SCB_CleanInvalidateDCache() and limited/controlled code in between them, but be careful with whole memory cache operations.
You have misunderstood the write-back vs write-through.
http://www.emcu.it/STM32F7/Slide/Cache2.png
http://www.emcu.it/STM32F7/STM32F7xx.html
AN4839 page 5:
AN4839 page 4:
So output kind-of "should" be 42 42 42 42 42 42 42 42, but output of 79 79 79 79 79 79 79 79 shows that eviction happened to that line before SCB_InvalidateDCache(). The dependency of a result on a function call vs local loop also "confirms" that.
2020-02-29 10:43 AM
Looks like the explanation of write-back in AN4838 is misleading, OP is not the first one to fall for it.
I thought there is at least a LRU based eviction algorithm, but pseudo-random explains a lot.
2020-03-04 08:04 AM
I know what a write-through vs. write-back is.
I thought the output "should" be 79 42 42 42 42 42 42 42 because write-back has two types, WBWA (allocate on write miss) and regular write-back (no allocate on write miss). From the ST's document, it says (emphasis mine) "Write-back with write and read allocate: on hits it writes to the cache setting dirty bit for the block, the main memory is not updated. On misses it updates the block in the main memory and brings the block to the cache." from (https://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf#%5B%7B%22num%22%3A16%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2Cnull%2C725.66931%2Cnull%5D). See how it says "on misses it updates the block in the main memory"?
(Although from the experiment it seems that the document is wrong)
Even so, printing 79 79 79 79 79 79 79 79 is weird, because I confirmed by reading the assembly that whether function call or not, nobody additionally touches the stack. So unless ST is using some weird hash function for cache mapping, consecutive access to that small amount of data should not evict anything from the cache, from my understanding.
Pseudo-random is definitely weird... though 512 Bytes of data access is nothing near the 4K cache size so any kind of eviction shouldn't be happening to begin with.
The only explanation I can think of is that ST is implementing some hash in mapping the memory address, so collision happens even within accessing such a small array. However, if that is true that is a very bad hash function not even worth implementing.
Another explanation is maybe there is a data prefetcher. This kinda explains the behavior of behaving different on calls vs. loops because the predictor will predict differently. However, I confirmed there is a I-cache prefetcher from the document but none indicated a data prefetcher. Do data prefetcher exists in my chip?
*I also tried invalidating cache for only the memory address for the array instead of the entire memory. The result was still the same.
2020-03-04 10:20 AM
> Although from the experiment it seems that the document is wrong
I too think that the document is simply wrong, it's actually WBWA. I recall seeing a similarly confused post about it the other day. The cache belongs to the licensed ARM core, so ARM documents might be more trustworthy on the matter.
To prove it experimentally, you can configure some external memory, mark it as WBWA in the MPU, and watch the read/write enable pins with a scope.
> nobody additionally touches the stack
Not even the attached debugger probe? Everything it does goes through the ARM core.
> Do data prefetcher exists in my chip?
Does the cache itself count, prefetching a whole cache line on read?
Then the instruction pipeline might contain one. Or it might not, a quick googling indicates it's not vulnerable to Spectre/Meltdown, but there is only the manufacturer's word for it (which comes from the PR department, so not worth the bits it's published on). It would be interesting to check one day.
But look at Figure 1. and Figure 2. in chapter 2 of the reference manual carefully. Note the two possible paths from the M7 core to the flash. Which one is taken depends on the address your program is linked at: based at 0x08000000 instruction fetches go through the same L1 cache as SRAM accesses, based at 0x02000000 it bypasses the L1 cache, and goes through the ART cache sitting before the flash. (The same flash mapped at both addresses). Looking at a few randomly selected F7 projects, they all seem to be linked at 0x08000000. Poor ART cache, I'd bet it's sitting unused in its whole life in more than 99.9% of the STM32F7 units sold.
You might want to repeat the experiment with all data (stack and variables) it DTCM except the array, and the function executed from ITCM RAM. Unroll the whole loop to negate any possible effect of the branch predictor, which can act weirdly too.