Load multiple instruction timing

gioacchino03 · ‎2009-03-10

Posted on March 10, 2009 at 13:57

gioacchino03 · ‎2011-05-17

Posted on May 17, 2011 at 13:05

Hello!

I have noted some differences between cortex-m3 theretical performnces and stm32 true performnces when executing load instructions...

The appliation note abaut the stm32 DMA you can find at this link:

http://www.st.com/stonline/products/literature/anp/13529.htm

tells that the CPU have to pay an AHB cycle when exsecuting load instruction

due to bus arbitration whit the DMA, so it seems that every load instruction takes a minimum of three CPU cycles:

1 AHB cycle (arbitration) + 2 CPU cycles (execution)

The contex-m3 can execute a load multiple instructions in (nReg + 1) cycles, where nReg is the number of registers loaded. In the STM32, due to bus arbitration, this value is (2 x nReg +1) cycles instead.

Can anyone confirm my idea?

16-32micros · ‎2011-05-17

Posted on May 17, 2011 at 13:05

Hi gioacchino03,

Quote:

.... so it seems that every load instruction takes a minimum of three CPU cycles: 1 AHB cycle (arbitration) + 2 CPU cycles (execution)

The contex-m3 can execute a load multiple instructions in (nReg + 1) cycles, where nReg is the number of registers loaded. In the STM32, due to bus arbitration, this value is (2 x nReg +1) cycles instead.

Could you let me know how you did the calculation ? I suggest you To check it on real STM32 silicon, you can run a small piece of code in assembly having some Multiple loads and then share with us the timing cycles.

Cheers,

STOne-32.

gioacchino03 · ‎2011-05-17

Posted on May 17, 2011 at 13:05

Hello,

tanks for your interest... My previous post was based on a simulation whit Keil uVision. Now I have done some test on real hardware:

My settings are:

* external crystal 4MHz

* PllMul = 16

* Pre-fetch buffer whit 2 wait-states

* AHBPRE = 1

This leads:

* core cycle time = (1/(4Mhz * 16)) = 15.6ns

STM32 taken about 11us to execute 100 instructions like this:

LDMIA SP,{R2,R3,R4,R5,R6,R7}

so each LDM instruction takes (11us/100) = 110ns = 7 x 15.6ns !

Conclusion: the minimum (DMA disabled) number of cycles to transfer n word from memory to registers is n+1 ! :D

16-32micros · ‎2011-05-17

Posted on May 17, 2011 at 13:05

Hi,

Excellent job ! So, You confirmed that STM32 is behaving exactly as stated in the theory of the instruction timing as defined by ARM Cortex-M3 core :) even the flash is with 2 wait states and CPU running at 64Mhz , let's try now when the DMA is enabled..

Cheers,

STOne-32.