Skip to main content
jerry2
Associate III
July 9, 2017
Question

STM32F746: Instruction Cache

  • July 9, 2017
  • 4 replies
  • 2646 views
Posted on July 09, 2017 at 06:38

I've got a simple test that runs under an RTOS. There are two identical tasks that blink separate LEDs using a software delay loop. The two tasks are identical except for the LED that each blinks. Both tasks have identical priorities, so the RTOS runs them in strict round robin.

If I have task1 first in the C file, then the LED associated with that task blinks at twice the rate of the LED associated with task2. If I swap the positions of the two tasks in the C file, then task2's blinks twice as fast as task1's LED. Swapping the tasks in the C file swaps their positions in FLASH.

The only thing that changes when I swap the positions of task1 and task2 in memory is the code addresses. The data and stack stay in the same place.

The only explanation for this behavior that I can think of is the relative positions of these two tasks in memory affects how the code is cached by the CPU. This is where I'm puzzled, however. According to the datasheet, only FLASH on the ITCM interface uses the ART accelerator. My code is in the FLASH on the AXIM interface.

Is there another instruction cache different from the ART accelerator? Is there any way to disable it to check whether it's the cause of the effect I'm seeing? I didn't find anything in either the datasheet or the reference manual.

This isn't a subtle effect--the software delay loop runs twice as fast in one task relative to the other.

    This topic has been closed for replies.

    4 replies

    waclawek.jan
    Super User
    July 9, 2017
    Posted on July 09, 2017 at 12:08

    0690X00000607ToQAI.png0690X00000607U2QAI.png

    Read AN4839 and AN4838.

    JW

    jerry2
    jerry2Author
    Associate III
    July 9, 2017
    Posted on July 09, 2017 at 23:39

    Interesting. Thank you. I missed that in my reading of the manual.

    I read those two app notes and better understand the relationship between the L1 cache and FLASH.

    Now I'm seeing behavior that's even more puzzling. There are two bits in the CCR (Configuration and Control Register) on a Cortex-M7, CCR.IC and CCR.DC, that control the enabling of the instruction and data caches. I my case, both of these bits were set to zero when I ran my test yesterday. So with the instruction cache off, two identical tasks would run at different speeds depending where they were placed in memory (FLASH). With certain placements, one task ran about twice as fast as the other.

    I ran the test again this afternoon, and confirmed that one task runs about twice as fast as the other with CCR.IC cleared. The really puzzling thing is when I set CCR.IC to '1' to enable the instruction cache, both tasks run at exactly the same speed! I'm at a loss to explain this. My previous assumption was that one task ran faster than the other because it was preferentially cached in the instruction cache. Now I find that I had the instruction cache turned off yesterday, but now when I turn it on the issue goes away.

    Anyone have any idea what may be going on here?

    jerry2
    jerry2Author
    Associate III
    July 17, 2017
    Posted on July 18, 2017 at 00:09

    What is the relationship between the CCR.IC bit and the memory protection unit (MPU) on an STM32F7? The ARM documentation says 'It is IMPLEMENTATION DEFINED whether the CCR.DC and CCR.IC bits affect the memory attributes generated by an enabled MPU.'

    AN4839 says that when the MPU is disabled (which it is in my case), a default memory map is used, and according to the ap note, the code region's cache policy is write-through. Does that mean that the instruction cache is on irregardless of the state of CCR.IC?

    STOne-32
    ST Technical Moderator
    July 20, 2017
    Posted on July 20, 2017 at 19:51

    Dear gentleman,

    The CCR register belongs to the cortex-M7 itself and it’s described in our Cortex-M7 Manual :  PM0253 Programming manual / section 4.3.7 Configuration and Control register. The bits DC and IC enables/disables respectively the data cache and instruction cache. The Enable and Disable of the cache is managed by two macros in the CMSIS lib: SCB_EnableDCache() / SCB_EnableICache.

    It’s not recommended to play with these bits unless you know what you are doing  and while taking into account all the procedure of enabling the cache (including cache invalidate) which is already done by the two macros provided above.

    Cheers,

    STOne-32.

    STOne-32
    ST Technical Moderator
    July 17, 2017
    Posted on July 18, 2017 at 00:21

    Dears,

    +

    http://www.st.com/content/ccc/resource/technical/document/application_note/0e/53/06/68/ef/2f/4a/cd/DM00169764.pdf/files/DM00169764.pdf/jcr:content/translations/en.DM00169764.pdf

    to have the global picture of our STM32F7 performance.  Then, go to Section 5 : Software memory partitioning and tips.

    Good Lecture.

    Cheers,

    STONe-32.

    jerry2
    jerry2Author
    Associate III
    July 18, 2017
    Posted on July 18, 2017 at 01:34

    I've read AN4667 and it doesn't address the issue of MPU versus CCR enabling of the instruction cache.

    I'm still puzzled why one section of code executes faster than another section of code when the instruction cache is turned off. This is counterintuitive behavior because neither piece of code is cached and hence should run at the same speed. Moving the two pieces of code around in FLASH relative to each other affects their speed. This affect would imply some kind of caching is happening.

    Is there perhaps some other mechanism at play here? Perhaps contention on the AXI/AHB interface?

    waclawek.jan
    Super User
    July 18, 2017
    Posted on July 18, 2017 at 11:21

    I believe you have too many variables in play.

    I'd start with simple and single code (no RTOS, no 'multiple instances'), trying to absolutely position its bulk at various offsets within the FLASH data width (64 bit of buswidth, and 256? bit of the actual FLASH, while instruction granularity is 16 bit), and try the influence of that. Changing latency (running it at low system clock) should show what's the real influence of FLASH access. I'd also try to estimate the impact of RAM/data accesses, stack, whatever. I'd then try to play with the caches.

    JW

    waclawek.jan
    Super User
    July 19, 2017
    Posted on July 19, 2017 at 13:29

    Ah so. I misunderstood the 'When I invalidate the I cache and turn it on (CCR.IC = 1), the loop runs at the same speed' part, reading it as 'runs at the same speed as when cache turned off'. As David says, this makes sense - running from cache should result in uniform execution time as cache provides instructions for the processor with no latency, as long as the bulk of the code fits into the cache and FLASH fetches are avoided entirely.

    When FLASH fetches are involved, as when the cache is off, then execution is significantly impacted by the number of FLASH reads needed to execute the bulk (one pass through the timing loop in this case), and of course the FLASH read latency (which is what I suggested to be one of the variable to play with - note that you can use freely higher latency with lower system clock, but not the other way round).  The FLASH row is 128 or 256 bits in width (see FLASH chapter in RM); and while there's no prefetch/ART on the AHB (i.e. non-TCM) port of it, I believe this whole row is latched upon a read into any part of it, and then served if a portion of the same row is read subsequently (even if this is not documented to do this, it would be rather foolish to repeat a FLASH-row-read... but one never knows of course -- ST may be willing to comment on this). Subsequently, instruction fetches are transferred through a 64-bit AHB bus into the AXIM-to-AHB interface and then to the AXIM port of processor. The Cortex-M7 features a prefetch unit, 64-bit wide, with a 4x64-bit pipeline; and there's also a speculative jump . The Thumb2 instructions are 16- or 2x16-bit wide... and the processor core is dual-issue superscalar... And then there may be data embedded in the code, such as 32-bit literals... so there is a bunch of opportunities for various caching-like behaviour on various stages between FLASH and processor; and this of course all depends on the particular alignment of things...

    The details are thus too numerous to be easily understood or a simple model to be drawn. There's a lot of handwaving when it comes to the details in the 32-bitters.

    I always say, the 32-bitters are NOT microcontrollers, they are SoC; but SoC from marketing point of view is not a sexy word enough...

    JW

    jerry2
    jerry2Author
    Associate III
    July 19, 2017
    Posted on July 19, 2017 at 19:50

    Yeah, there's obviously a lot going on even with the instruction cache turned off. Since everything runs without variance (and much faster too) with the instruction cache turned on, I think I let sleeping dogs lie and not pursue this any further.

    Thanks for all your comments.