The CPU performance difference between running in the flash and running in the RAM for STM32F407

robin-xiaoyong · ‎2014-06-02

Posted on June 03, 2014 at 04:19

Hello,

I try to find the performance difference between running in the flash and running in the RAM for STM32F407.

I write a test source code. The function runs in the flash or in the RAM.

I find that the performance in the RAM is about 20% poorer than in the flash when the CPU frequency is 168MHz.

The datasheet describes that ''the performance achieved thanks to the ART accelerator is equivalent to 0 wait state program execution from Flash memory at a CPU frequency up to 168 MHz''.

And the datasheet also describes that ''RAM memory is accessed (read/write) at CPU clock speed with 0 wait states''.

Both the RAM and flash memory are accessed with 0 wait states.

Why is the performance in the RAM poorer than in the flash? Is it reasonable?

Remark:

The compiler is IAR Embedded Workbench for ARM 7.10.1.6735 . optimization=high.

My source code:

int main(void)

{

/* Initialize Leds mounted on STM32F4-Discovery board */

STM_EVAL_LEDInit(LED4);

STM_EVAL_LEDInit(LED3);

STM_EVAL_LEDInit(LED5);

STM_EVAL_LEDInit(LED6);

GPIO_PORT[1]->BSRRL = GPIO_PIN[1];

code_to_be_measured();

GPIO_PORT[1]->BSRRH = GPIO_PIN[1];

}

#ifdef PLACE_IN_RAM

__ramfunc void code_to_be_measured()

#else

void code_to_be_measured()

#endif

{

volatile unsigned long int l_count, l_count_max=1000;

volatile int a,b,s;

volatile unsigned int j;

a = 1000;

b = 2000;

for (l_count=0;l_count<l_count_max;++l_count)

{

s = a + b + l_count;

}

Tesla DeLorean · ‎2014-06-02

Posted on June 03, 2014 at 05:30

The ART has an accelerated prefetch path to the processor, the width of the cache line can provide the data to the processor in zero time, RAM will take it's cycle. The cache is generally faster, but can be unpredictable due to line eviction. Getting the line from flash will take a hit (based on the wait states), but subsequent words within the line are essentially delay free.

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..

robin-xiaoyong · ‎2014-06-02

Posted on June 03, 2014 at 05:44

Hello,

Thank you for your answer.

But the datasheet describes that ''RAM memory is accessed (read/write) at CPU clock speed with 0 wait states''.

Does it mean that RAM can also provide the data to the processor in zero time?

The ART has an accelerated prefetch path to the processor, the width of the cache line can provide the data to the processor in zero time, RAM will take it's cycle. The cache is generally faster, but can be unpredictable due to line eviction. Getting the line from flash will take a hit (based on the wait states), but subsequent words within the line are essentially delay free.

stm322399 · ‎2014-06-02

Posted on June 03, 2014 at 07:49

My guess is that when you run from Flash, you allow concurrent accesses of IBus to Flash and DBus to SRAM. This give a little advantage compared to the situation where I&D buses compete for SRAM.

To make it clear, one might be interested in comparing Flash vs SRAM execution whereas data lies in CCM.

waclawek.jan · ‎2014-06-03

Posted on June 03, 2014 at 09:19

Points to the benchmark on pages 41 and on in 'STM32 Technical Updates Issue 1' from the sticky post on top of this forum. [EDIT] see https://community.st.com/s/question/0D50X00009Xkba4SAB/stm32-technical-updates and attachments there [/EDIT] JW

[EDIT 2018 linkfix]

https://community.st.com/0D50X00009XkaPCSAZ

[/EDIT]

[EDIT another linkfix later in 2018]

https://community.st.com/s/question/0D50X00009XkaPCSAZ/stm32f40x-168mhz-wait-states-and-execution-from-ram

[/EDIT]

Tesla DeLorean · ‎2014-06-03

Posted on June 03, 2014 at 12:30

But the datasheet describes that ''RAM memory is accessed (read/write) at CPU clock speed with 0 wait states''. Does it mean that RAM can also provide the data to the processor in zero time?

No, it means that no additional cycles beyond those usually required are needed. It still has to access the RAM array via the normal access sequence. The cache on the other hand already has the data locally, and just requires some gating logic to direct it to the core, an expedited path that saves at least one cycle over a fetch from physical memory (RAM, ROM, FLASH), and is not contended by DMA operations.

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..