True performance of STM32?

lspr35 · ‎2008-11-17

Posted on November 17, 2008 at 13:39

#stm32

lspr35 · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hi

Your result is very interesting. Do You know if the STR in Your previous example is a 16bit or 32bit instruction?

Regards

Squonk

lspr35 · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hello,

I made a similar test with 40 consecutive asm() instructions with the following results:

asm(''adds r1, r2, #1'') with 2 waitstates: approx. 16ns

asm(''adds r1, r2, #1'') with 1 waitstate: approx. 14ns

asm(''mul r1, r2, r3'') with 2 waitstates: approx. 23ns

asm(''mul r1, r2, r3'') with 1 waitstate: approx. 16ns

Obviously the 14ns can be achieved by 16bit instructions (adds) whereas the 32bit instruction (mul) takes 1,5 times 14ns. My time measurements are not very precise since I do not have an oscilloscope available in the moment.

Additionally I think that other effects also have an impact on instruction execution times (e.g. RAM access, ...).

A detailed information about instruction execution times would be very interesting. The ST documentation is not very verbose on this topic. Is it possible to find better information in the CORTEX documentation?

Regards

Squonk

ivan239955_stm1_st · ‎2011-05-17

Posted on May 17, 2011 at 12:19

I think that it is difficult to measure instruction timing unless you use assembler. With C compiler you don't know what you get, unless you disassemble it.

Most Cortex-M3 instructions are executed in 1 cycle.

STR and LDR are usually 2 cycles (1 when pipelined e.g. more LDR or STR instructions in sequence)

STR Rx,[Ry,#imm] is always one cycle.

Few notes about documents with usefull instruction timing/description:

CortexM3_TechRefMan_r1p1_trm.pdf

(infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf )

most usefull thing here is instruction timing: pages 18-1 to 18-8

DDI0405B_arm_v7m_architecture_app_level_reference_manual.pdf

(infocenter.arm.com/help/topic/com.arm.doc.ddi0405b/index.html )

contains detailed description of every instruction pages A6-1 to A6-276

hopefully this helps...

Ivan

cosmapa · ‎2011-05-17

Posted on May 17, 2011 at 12:19

GPIOA->BSRR = GPIO_Pin_10;

GPIOA->BRR = GPIO_Pin_10;

compile in 3 instructions each using the IAR toolset.

Which compiler/toolset are you using to get each statement compiled in a single ARM instruction, as stated in the earlier post?

Thanks

per3 · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hi Mr Swiss.

Remember that it was one STR instruction per C-line when *repeatedly* used (20 lines or so) as I described earilier in the post.

Guess which compiler/toolchain I use!

(the one reccomended by ST)

per3 · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Actually it seems ST is recommending a bundle of toolchains, I saw when I did a relook.

I use ST/Raissonance Rlink with RIDE 7 and gcc.

16-32micros · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Dear all,

Today, the best Compiler for STM32 ( Cortex-M3) is the ARM/Keil One for Speed execution using Agressive time optimization ( -Otime -O2 or -O3),

For GPIO Toggle speed we can reach up to 18MHz of Pin Toggle at Flash with 2 wait states and running at 72MHz with a code like :

STR ( 2x CPU cycles)

...

If you insert a Jump or a branch , of course it will take about 3 cycles in a raw and stalls/flushes the 3 stage/pipeline of the Core and the Flash accelerator.

To increase the performance for such code you can force inlining and loop unrolling using compiler options , Thank you :)

STOne-32

jas · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Quote:

I use ST/Raissonance Rlink with RIDE 7 and gcc.

You're from Sweden and you don't use IAR's EWARM ?

Shame on you ! :-]

16-32micros · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hi Arm_Wrestler,all

IAR EWARM is also providing a very good executing code speed vs size.

For GNU , Here an interesting discussion

http://www.st.com/mcu/forums-cat-5951-23.html

:)

cosmapa · ‎2011-05-17

Posted on May 17, 2011 at 12:19

I am still at loss with the slow GPIO toggle speed that I see.

If I set and reset bit 7 of port C, with the code below

GPIOC->BSRR = GPIO_Pin_7; ( = 3 assembly instructions)

GPIOC->BSR = GPIO_Pin_7; ( = 3 assembly instructions)

I measure a pulse that stays high for precisely 124.4ns (+/- 0.2ns)

If now I insert 10 x asm (''cmp r1,r2''); between the Set and Reset, I measure precisely 235.5 (+/- 0.2ns). This means that each of the cmp instruction executed in 11ns. This is a bit faster than expected for a single cycle execution at 72MHz but it is within range.

The part that confuses me is the 124ns for executing the 3 instructions that reset the GPIO bit. This is a huge discrepancy. Where could this come from?