cancel
Showing results for 
Search instead for 
Did you mean: 

uClinux and STM32F4 - performance?

robert23
Associate
Posted on December 09, 2013 at 15:50

Hi,

I have an interest in running Linux (uClinux) on the Cortex M4, namely one of the MCU�s in the STM32F4 series. These MCU�s have I-code/D-code caches for flash access as part of the ART accelerator, but clearly the internal flash is not nearly large enough to satisfy the requirements for uClinux; executing from external RAM is a must. However, I am a bit worried about the performance implications by doing so, since external RAM appears to be completely uncached.

However, external RAM can be remapped to the code space (< 0x20000000) and the RM0090 reference manual states that �In remap mode, the CPU can access the external memory via ICode bus instead of System bus which boosts up the performance.� Does this means it allows for cached instruction fetches from the external RAM (through the I-code bus)? Or is it simply to prevent arbitration over the S-bus when e.g. using the internal SRAM at the same time? I guess the latter.

#uclinux #stm32f4 #cache
3 REPLIES 3
Posted on December 09, 2013 at 17:27

There is no intrinsic caching in the M3/M4, the ART is a kludge in front of the wide/slow flash array that delivers data to the prefetcher quickly. It has no utility for other data sources.

On the 429 with SDRAM mapped into executable space the execution is 6x slower than from internal RAM.

Consider if you can use a 2MB flash variant, and execute-in-place, or if some other architecture would be better suited to the goal.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
beedward
Associate
Posted on February 15, 2014 at 02:08

Clive,

There are two different remapping possibilites for the SDRAM on the STM32F429, both set via SYSCFG_MEMRMP.  Bits 11:10, when set to 01, swap the SDRAM banks and NAND Bank 2/PCCARD. SDRAM Bank 1 is mapped at NAND Bank 2 (0x8000 0000) and NAND Bank 2 is mapped at 0xC000 0000.

Bits 2:0, when set to 100, map the SDRAM to 0x0000 0000.

Looking at the SDRAM controller and M4 architecture, code executing out of SDRAM mapped to 0x0000 0000 (with memory bus clock at 1/2 CPU clock, and CAS2 latency) ought to run about 1/3 to 1/4 as fast as code from the internal Flash (1/2 for purely sequential, all 16 bit instructions...).  You're seeing 1/6.  I'm wondering why.  I know there aren't any caches but sequential accesses to the SDRAM ought to be happening every memory bus clock cycle.  Maybe you have to do both remapping steps for some reason?  I don't have hardware yet to try but I'm curious to see what happens if you try the SDRAM/NAND remap along with the boot memory remap.

On the 429 with SDRAM mapped into executable space the execution is 6x slower than from internal RAM.

Consider if you can use a 2MB flash variant, and execute-in-place, or if some other architecture would be better suited to the goal.

Posted on February 15, 2014 at 06:20

Pretty sure the SDRAM access is optimized for the display function, not code execution.

What board are you planning to try this on?

Do you have some code that would be a representative benchmark of code execution is various memory types? Can you furnish that in 16-bit word format?

    static const uint16_t Code[] = { 0xF04F, 0x007B, 0x4770 }; // payload

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..