cancel
Showing results for 
Search instead for 
Did you mean: 

True performance of STM32?

lspr35
Associate II
Posted on November 17, 2008 at 13:39

True performance of STM32?

#stm32
53 REPLIES 53
16-32micros
Associate III
Posted on May 17, 2011 at 12:19

The original post was too long to process during our migration. Please click on the provided URL to read the original post. https://st--c.eu10.content.force.com/sfc/dist/version/download/?oid=00Db0000000YtG6&ids=0680X000006I6Xu&d=%2Fa%2F0X0000000bqL%2Fka7Fxmu0ALYtcn0p.PkTV6tiiJGwzgqA4OXngir5xIw&asPdf=false
lspr35
Associate II
Posted on May 17, 2011 at 12:19

I am interested in the true performance (e.g. in MIPS) of the STM32. The datasheet indicates:

- ''72 MHz maximum frequency''

- ''1.25 DMIPS/MHz at 0 wait state memory access''

This suggests that the STM32 has approx. 90 MIPS. I am not sure if this is really true. Page 32 of the Reference Manual indicates that the internal flash must be operated with 2 wait states if working with 72MHz. On the other hand the internal flash seems to have a databus width of 64 bits. Since many thumb instructions have a width of 16 bit four instructions could be fetched within one memory access.

So, what is the resulting perfromance of the STM32? Does anybody have a direct benchmark with other Controllers? Are there any other issues which might affect the performance?

Regards

Squonk

cosmapa
Associate II
Posted on May 17, 2011 at 12:19

I have written code to toggle GPIO bits as fast as possible. The dissasembly shows that it is done using 3 instructions to set, plus 3 to reset an output on port C. Since the toggling takes place in a while(1) loop, there is an extra branch instruction in the dissassembled code.

Writing a bit on the GPIO port is done using bit banding. So it is a simple direct write to a fixed memory loacation.

I would have thought that at 72MHz, instruction period being 14ns, x 3 instructions = bit change every 42ns, if operating at one instruction per clock cycle.

On the scope I measure 166ns (output high) and 181ns (output low). The difference between is that added branch instruction which takes 15ns (exactly one clock cycle as advertised).

Where does the 4x difference comes from on the bit toggling part? I can see that APB is running slower but the APB access in only occurring during one of the 3 instructions.

Assembly code is below:

SET_ENC_SEL;

08000738 4803 LDR R0, [PC,#0x00C] ; [0x8000748] =0x4222019C

0800073A 2101 MOVS R1#0x1

0800073C 6001 STR R1, [R0, #0]

RES_ENC_SEL;

0800073E 4802 LDR R0, [PC,#0x008] ; [0x8000748] =0x4222019C

08000740 2100 MOVS R1#0x0

08000742 6001 STR R1, [R0, #0]

08000744 E7F8 B 0x8000738

lspr35
Associate II
Posted on May 17, 2011 at 12:19

I made a similar experiment:

I wrote a program which executes 28 instructions. These are 18 two-word instructions and 10 single-word instructions. This code takes approximately 1us for execution. This is an average execution time of 38ns per instruction.

If I set the number of wait states to 1 (which is not allowed for 72MHz) the execution time reduces to approximately 0,85us. This is an average execution time of approximately 30ns. So this is still far away from 14ns.

I.m.O this experiment shows that obviously the flash access time is limiting the performance of the Controller. But I still don't understand why: the flash databus is 64 bits wide. So two 32bits instructions are fetched at a time. With two waitstates (as necessary for 72MHz) this should only increase execution time of 32bits instructions by a factor of 1,5 and for 16bits instructions there should be no increase.

Is it possible that I made a mistake in initializing the RCC?

Regards

Squonk

per3
Associate II
Posted on May 17, 2011 at 12:19

I noticed if I keep a short loop, 72 Mhz, 2 ws, I get about 50 ns per instruction, if i make a long loop with repeated on off,

GPIOx = GPIOA;

while(1){

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

GPIOx->BSRR = GPIO_Pin_10;

GPIOx->BRR = GPIO_Pin_10;

}

then the the low and high is 25 ns.

each GPIOx-BRR results in a STR instruction.

The strange part which I have not figurued out is when I run HSI at 8 MHz and zero wait states I get 250 ns, as if the clock was running at 4 Mhz???

lspr35
Associate II
Posted on May 17, 2011 at 12:19

Hi espresso_solo,

the 25ns is still not the execution time I would expect. Setting and resetting a GPIO pin is probably a 16bit instruction. So these instructions should execute in 13.888ns (=1/72MHz) and not 25ns. Maybe the value You measured was 27.777ns (=2/72MHz)?

But anyhow, i.m.o the 16bit instructions should execute within one cycle only because the flash has a 64bit databus.

Concerning Your question: what is Your SYSCLK frequency? Do You use the PLL multiplier (PLLMUL)?

Regards

Squonk

lspr35
Associate II
Posted on May 17, 2011 at 12:19

Hi,

so in this case SYSCLK will be 8MHz and I would expect an instruction execution time of 125ns.

On the other hand: with HSI the clock frequency is lower than 72MHz by a factor of 9. And 9 x 27.777ns = 250ns! Voila! But I don't understand it. This would mean that the CORTEX is a two-cycle achitecture - but it is not! Something is strange here!

Regards

Squonk

per3
Associate II
Posted on May 17, 2011 at 12:19

Hi,

Regarding the RCC setup I leave everything to as it was when reset at startup, ie HSI should be selected and running, SYSCLOCK should have HSI at full speed, and PLL is not selected.

Try it yourself, and see if you get similar results.

per3
Associate II
Posted on May 17, 2011 at 12:19

Hi,

I tried this,

GPIOx = GPIOA;

while(1){

GPIOx->BSRR = GPIO_Pin_10;

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

asm (''cmp r1,r2'');

GPIOx->BRR = GPIO_Pin_10;

}

The 8MHZ HSI try gave 2600 ns high

The 72 Mhz try gave 320 ns high

21 instructions:

2600/21 = approx 125 ns

320/21 = approx 14 ns

Seems the STR is for some reason 2 cycle, while the CMP is one cycle.

It also seems 14ns speed can be achieved with 2 wait states.