STM32F746: Instruction Cache

jerry2 · ‎2017-07-08

Posted on July 09, 2017 at 06:38

I've got a simple test that runs under an RTOS. There are two identical tasks that blink separate LEDs using a software delay loop. The two tasks are identical except for the LED that each blinks. Both tasks have identical priorities, so the RTOS runs them in strict round robin.

If I have task1 first in the C file, then the LED associated with that task blinks at twice the rate of the LED associated with task2. If I swap the positions of the two tasks in the C file, then task2's blinks twice as fast as task1's LED. Swapping the tasks in the C file swaps their positions in FLASH.

The only thing that changes when I swap the positions of task1 and task2 in memory is the code addresses. The data and stack stay in the same place.

The only explanation for this behavior that I can think of is the relative positions of these two tasks in memory affects how the code is cached by the CPU. This is where I'm puzzled, however. According to the datasheet, only FLASH on the ITCM interface uses the ART accelerator. My code is in the FLASH on the AXIM interface.

Is there another instruction cache different from the ART accelerator? Is there any way to disable it to check whether it's the cause of the effect I'm seeing? I didn't find anything in either the datasheet or the reference manual.

This isn't a subtle effect--the software delay loop runs twice as fast in one task relative to the other.

waclawek.jan · ‎2017-07-09

Posted on July 09, 2017 at 12:08

Read AN4839 and AN4838.

JW

jerry2 · ‎2017-07-09

Posted on July 09, 2017 at 23:39

Interesting. Thank you. I missed that in my reading of the manual.

I read those two app notes and better understand the relationship between the L1 cache and FLASH.

Now I'm seeing behavior that's even more puzzling. There are two bits in the CCR (Configuration and Control Register) on a Cortex-M7, CCR.IC and CCR.DC, that control the enabling of the instruction and data caches. I my case, both of these bits were set to zero when I ran my test yesterday. So with the instruction cache off, two identical tasks would run at different speeds depending where they were placed in memory (FLASH). With certain placements, one task ran about twice as fast as the other.

I ran the test again this afternoon, and confirmed that one task runs about twice as fast as the other with CCR.IC cleared. The really puzzling thing is when I set CCR.IC to '1' to enable the instruction cache, both tasks run at exactly the same speed! I'm at a loss to explain this. My previous assumption was that one task ran faster than the other because it was preferentially cached in the instruction cache. Now I find that I had the instruction cache turned off yesterday, but now when I turn it on the issue goes away.

Anyone have any idea what may be going on here?

jerry2 · ‎2017-07-17

Posted on July 18, 2017 at 00:09

What is the relationship between the CCR.IC bit and the memory protection unit (MPU) on an STM32F7? The ARM documentation says 'It is IMPLEMENTATION DEFINED whether the CCR.DC and CCR.IC bits affect the memory attributes generated by an enabled MPU.'

AN4839 says that when the MPU is disabled (which it is in my case), a default memory map is used, and according to the ap note, the code region's cache policy is write-through. Does that mean that the instruction cache is on irregardless of the state of CCR.IC?

STOne-32 · ‎2017-07-17

Posted on July 18, 2017 at 00:21

Dears,

+

http://www.st.com/content/ccc/resource/technical/document/application_note/0e/53/06/68/ef/2f/4a/cd/DM00169764.pdf/files/DM00169764.pdf/jcr:content/translations/en.DM00169764.pdf

to have the global picture of our STM32F7 performance. Then, go to Section 5 : Software memory partitioning and tips.

Good Lecture.

Cheers,

STONe-32.

jerry2 · ‎2017-07-17

Posted on July 18, 2017 at 01:34

I've read AN4667 and it doesn't address the issue of MPU versus CCR enabling of the instruction cache.

I'm still puzzled why one section of code executes faster than another section of code when the instruction cache is turned off. This is counterintuitive behavior because neither piece of code is cached and hence should run at the same speed. Moving the two pieces of code around in FLASH relative to each other affects their speed. This affect would imply some kind of caching is happening.

Is there perhaps some other mechanism at play here? Perhaps contention on the AXI/AHB interface?

waclawek.jan · ‎2017-07-18

Posted on July 18, 2017 at 11:21

I believe you have too many variables in play.

I'd start with simple and single code (no RTOS, no 'multiple instances'), trying to absolutely position its bulk at various offsets within the FLASH data width (64 bit of buswidth, and 256? bit of the actual FLASH, while instruction granularity is 16 bit), and try the influence of that. Changing latency (running it at low system clock) should show what's the real influence of FLASH access. I'd also try to estimate the impact of RAM/data accesses, stack, whatever. I'd then try to play with the caches.

JW

jerry2 · ‎2017-07-18

Posted on July 18, 2017 at 18:28

I've simplified things by removing the RTOS and running very simple software delay loops with interrupts disabled.

The code continues to exhibit a difference in speed depending on where it's loaded in FLASH. I've ruled out data accesses for two reasons: moving the code around did not change the location of the counter variable in SRAM, and I recoded the delay loop in assembly and kept the counter in a CPU register (so there was no data access at all).

When I invalidate the I cache and turn it on (CCR.IC = 1), the loop runs at the same speed no matter where it's placed in FLASH.

Running at 216 MHz or 108 MHz doesn't make a difference.

waclawek.jan · ‎2017-07-18

Posted on July 18, 2017 at 21:48

Hummm then...

JW

David Littell · ‎2017-07-18

Posted on July 19, 2017 at 00:10

gardner.jerry wrote:

...

The code continues to exhibit a difference in speed depending on where it's loaded in FLASH.

...

This is probably an excellent question for your friendly, local FAE. Please let us know if you actually get an answer.

gardner.jerry wrote:

...

When I invalidate the I cache and turn it on (CCR.IC = 1), the loop runs at the same speed no matter where it's placed in FLASH.

...

I should hope so, you're running exclusively from the cache at this point. You won't see the difference again until your code starts missing and forcing line loads from Flash.

Dave