STM32F4: How can I move code into fast RAM?

Lars Beiderbecke · ‎2017-12-31

Posted on December 31, 2017 at 10:23

So far I've been using the STM32F722, but I'm planning to switch to either STM32F446 or STM32F429.

On the F722, I moved speed-critial code to ITCM RAM. The F4s don't have ITCM RAM, but an ICache and/or CCM. How do I move code to any fast RAM and execute it with 0 wait states on the F4? (It seems like the caches are meant to ensure 0 wait state execution, but I'd like to be sure.)

#fast-ram

Tesla DeLorean · ‎2017-12-31

Posted on December 31, 2017 at 15:47

No, the ART is bolted onto the side of the CM3/CM4. Code tends to run marginally faster from FLASH as the ~35ns cost of loading the 128-bit cache line is charged to the first word you read within the line the other 7 words prefetch within the current cycle, so out-pace SRAM, especially if there is DMA contention.

The ART provides for slightly variable execution speed, but usually as fast as SRAM and often better. The variability in SRAM would come from DMA contention, and reads to slower APB based peripherals, ie something that takes 4-cycles and will stall the pipeline.

For speed use assembler, use registers, and long runs of linear code. ie unroll loops, use IT instructions, and don't call subroutines

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

View solution in original post

Tesla DeLorean · ‎2017-12-31

Posted on December 31, 2017 at 14:11

The ART cache on the F4 delivers 0-cycle (ie same cycle) not zero wait state. The SRAM and CCM are 1-cycle.

If you want to run from RAM copy the code into the 0x20000000 based SRAM and jump to it there. You could use pragmas or attributes to get the linker to move functions there automatically.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Lars Beiderbecke · ‎2017-12-31

Posted on December 31, 2017 at 15:09

Thanks for your clarification. Is there a way to pre-populate the cache with some functions, or 'pin' them so that they won't be evicted from the cache?

Tesla DeLorean · ‎2017-12-31

Posted on December 31, 2017 at 15:47

No, the ART is bolted onto the side of the CM3/CM4. Code tends to run marginally faster from FLASH as the ~35ns cost of loading the 128-bit cache line is charged to the first word you read within the line the other 7 words prefetch within the current cycle, so out-pace SRAM, especially if there is DMA contention.

The ART provides for slightly variable execution speed, but usually as fast as SRAM and often better. The variability in SRAM would come from DMA contention, and reads to slower APB based peripherals, ie something that takes 4-cycles and will stall the pipeline.

For speed use assembler, use registers, and long runs of linear code. ie unroll loops, use IT instructions, and don't call subroutines

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..