STM32F746: Instruction Cache

jerry2 · ‎2017-07-08

Posted on July 09, 2017 at 06:38

I've got a simple test that runs under an RTOS. There are two identical tasks that blink separate LEDs using a software delay loop. The two tasks are identical except for the LED that each blinks. Both tasks have identical priorities, so the RTOS runs them in strict round robin.

If I have task1 first in the C file, then the LED associated with that task blinks at twice the rate of the LED associated with task2. If I swap the positions of the two tasks in the C file, then task2's blinks twice as fast as task1's LED. Swapping the tasks in the C file swaps their positions in FLASH.

The only thing that changes when I swap the positions of task1 and task2 in memory is the code addresses. The data and stack stay in the same place.

The only explanation for this behavior that I can think of is the relative positions of these two tasks in memory affects how the code is cached by the CPU. This is where I'm puzzled, however. According to the datasheet, only FLASH on the ITCM interface uses the ART accelerator. My code is in the FLASH on the AXIM interface.

Is there another instruction cache different from the ART accelerator? Is there any way to disable it to check whether it's the cause of the effect I'm seeing? I didn't find anything in either the datasheet or the reference manual.

This isn't a subtle effect--the software delay loop runs twice as fast in one task relative to the other.

waclawek.jan · ‎2017-07-19

Posted on July 19, 2017 at 13:29

Ah so. I misunderstood the 'When I invalidate the I cache and turn it on (CCR.IC = 1), the loop runs at the same speed' part, reading it as 'runs at the same speed as when cache turned off'. As David says, this makes sense - running from cache should result in uniform execution time as cache provides instructions for the processor with no latency, as long as the bulk of the code fits into the cache and FLASH fetches are avoided entirely.

When FLASH fetches are involved, as when the cache is off, then execution is significantly impacted by the number of FLASH reads needed to execute the bulk (one pass through the timing loop in this case), and of course the FLASH read latency (which is what I suggested to be one of the variable to play with - note that you can use freely higher latency with lower system clock, but not the other way round). The FLASH row is 128 or 256 bits in width (see FLASH chapter in RM); and while there's no prefetch/ART on the AHB (i.e. non-TCM) port of it, I believe this whole row is latched upon a read into any part of it, and then served if a portion of the same row is read subsequently (even if this is not documented to do this, it would be rather foolish to repeat a FLASH-row-read... but one never knows of course -- ST may be willing to comment on this). Subsequently, instruction fetches are transferred through a 64-bit AHB bus into the AXIM-to-AHB interface and then to the AXIM port of processor. The Cortex-M7 features a prefetch unit, 64-bit wide, with a 4x64-bit pipeline; and there's also a speculative jump . The Thumb2 instructions are 16- or 2x16-bit wide... and the processor core is dual-issue superscalar... And then there may be data embedded in the code, such as 32-bit literals... so there is a bunch of opportunities for various caching-like behaviour on various stages between FLASH and processor; and this of course all depends on the particular alignment of things...

The details are thus too numerous to be easily understood or a simple model to be drawn. There's a lot of handwaving when it comes to the details in the 32-bitters.

I always say, the 32-bitters are NOT microcontrollers, they are SoC; but SoC from marketing point of view is not a sexy word enough...

JW

jerry2 · ‎2017-07-19

Posted on July 19, 2017 at 19:50

Yeah, there's obviously a lot going on even with the instruction cache turned off. Since everything runs without variance (and much faster too) with the instruction cache turned on, I think I let sleeping dogs lie and not pursue this any further.

Thanks for all your comments.

STOne-32 · ‎2017-07-20

Posted on July 20, 2017 at 19:51

Dear gentleman,

The CCR register belongs to the cortex-M7 itself and itâ€™s described in our Cortex-M7 Manual : PM0253 Programming manual / section 4.3.7 Configuration and Control register. The bits DC and IC enables/disables respectively the data cache and instruction cache. The Enable and Disable of the cache is managed by two macros in the CMSIS lib: SCB_EnableDCache() / SCB_EnableICache.

Itâ€™s not recommended to play with these bits unless you know what you are doing and while taking into account all the procedure of enabling the cache (including cache invalidate) which is already done by the two macros provided above.

Cheers,

STOne-32.