cancel
Showing results for 
Search instead for 
Did you mean: 

True performance of STM32?

lspr35
Associate II
Posted on November 17, 2008 at 13:39

True performance of STM32?

#stm32
53 REPLIES 53
disirio
Associate II
Posted on May 17, 2011 at 12:19

As reference, the scores from 2 common ARM7 microcontrollers:

LPC2148

***************************************************************************

Kernel: ChibiOS/RT 0.7.2

Compiler: GCC 4.3.2 (YAGARTO 28.09.2008)

Options: -O2 -fomit-frame-pointer -mabi=apcs-gnu -falign-functions=16

Settings: CCLK=48, MAMCR=2, MAMTIM=3 (3 wait states)

***************************************************************************

-------

--- Test Case 13 (Benchmark, context switch #1, optimal)

--- Score : 144453 msgs/S, 288906 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 14 (Benchmark, context switch #2, empty ready list)

--- Score : 111980 msgs/S, 223960 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 15 (Benchmark, context switch #3, 4 threads in ready list)

--- Score : 111979 msgs/S, 223958 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 16 (Benchmark, threads creation/termination, worst case)

--- Score : 86464 threads/S

--- Result: SUCCESS

-------

--- Test Case 17 (Benchmark, threads creation/termination, optimal)

--- Score : 118939 threads/S

--- Result: SUCCESS

-------

--- Test Case 18 (Benchmark, mass reschedulation, 5 threads)

--- Score : 35870 reschedulations/S, 215220 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 19 (Benchmark, I/O Queues throughput)

--- Score : 341232 bytes/S

--- Result: SUCCESS

-------

AT91SAM7X

***************************************************************************

Kernel: ChibiOS/RT 0.7.2

Compiler: GCC 4.3.2 (YAGARTO 28.09.2008)

Options: -O2 -fomit-frame-pointer -mabi=apcs-gnu

Settings: MCK=48.054857, MC_FMR = AT91C_MC_FWS_1FWS (1 wait state)

***************************************************************************

-------

--- Test Case 13 (Benchmark, context switch #1, optimal)

--- Score : 114945 msgs/S, 229890 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 14 (Benchmark, context switch #2, empty ready list)

--- Score : 90221 msgs/S, 180442 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 15 (Benchmark, context switch #3, 4 threads in ready list)

--- Score : 90221 msgs/S, 180442 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 16 (Benchmark, threads creation/termination, worst case)

--- Score : 66878 threads/S

--- Result: SUCCESS

-------

--- Test Case 17 (Benchmark, threads creation/termination, optimal)

--- Score : 92312 threads/S

--- Result: SUCCESS

-------

--- Test Case 18 (Benchmark, mass reschedulation, 5 threads)

--- Score : 27850 reschedulations/S, 167100 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 19 (Benchmark, I/O Queues throughput)

--- Score : 240464 bytes/S

--- Result: SUCCESS

-------

Note, the tests were performed in ARM mode, the scores in THUMB mode are much worse. The wait states are those recommended by the manufacturers.

regards,

Giovanni

---

ChibiOS/RT

http://chibios.sourceforge.net

[ This message was edited by: disirio on 18-10-2008 11:24 ]

jrnore
Associate II
Posted on May 17, 2011 at 12:19

Hi Giovanni,

What is the ACR register of the STM32? What STM32 did you use?

I have benched the STM32F103RB (STM32 Primer) and adding 2 wait-state decrease significantly the performance. BTW, there is absolutely nothing in the stm32f103rb datasheet about peripherals interface (registers, ...).

Regards,

Frank

jrnore
Associate II
Posted on May 17, 2011 at 12:19

Quote:

On 20-10-2008 at 11:05, Anonymous wrote:

Hi Giovanni,

What is the ACR register of the STM32? What STM32 did you use?

I have benched the STM32F103RB (STM32 Primer) and adding 2 wait-state decrease significantly the performance. BTW, there is absolutely nothing in the stm32f103rb datasheet about peripherals interface (registers, ...).

Regards,

Frank

Hmmm... I have tested again the 1 and 2 wait-states configuration, but with the (1<

1. What is this bit exactly? Is it a kind of flash ''accelerator'' ?

2. It is not documented in the RM0008.pdf (Reference Manual for Low-, medium- and high-density STM32F101xx, STM32F102xx

and STM32F103xx advanced ARM-based 32-bit MCUs).

Why?

3. Is-there drawbacks or limitation when using this ''hidden'' feature?

FYI:

. With 0 WS, my bench uses 67476 cycles

. With ACR=0x01 (1 WS), my bench uses 97728 cycles.

. With ACR=0x11 (1 WS), my bench uses now 71904 cycles.

I would like to know if I can use this feature even if this is not documented...

Regards,

Frank

benedwards19
Associate II
Posted on May 17, 2011 at 12:19

Quote:

Hmmm... I have tested again the 1 and 2 wait-states configuration, but with the (1<

1. What is this bit exactly? Is it a kind of flash ''accelerator'' ?

2. It is not documented in the RM0008.pdf (Reference Manual for Low-, medium- and high-density STM32F101xx, STM32F102xx

and STM32F103xx advanced ARM-based 32-bit MCUs).

Why?

3. Is-there drawbacks or limitation when using this ''hidden'' feature?

FYI:

. With 0 WS, my bench uses 67476 cycles

. With ACR=0x01 (1 WS), my bench uses 97728 cycles.

. With ACR=0x11 (1 WS), my bench uses now 71904 cycles.

I would like to know if I can use this feature even if this is not documented...

Regards,

Frank

Frank,

The ACR is documented in the Flash programming manual. (13259.pdf) According to this, bit 4 in the ACR is the prefetch buffer enable bit.

Obviously turning on the prefetch will speed up sequential accesses and remove some of the hit you take by having wait states.

Good luck!

-Ben

lspr35
Associate II
Posted on May 17, 2011 at 12:19

Hi,

I think the means to achieve 1,25DMips/MHz is to operate the device at 48MHz with 0 wait states. In this case You achieve the highest DMips/MHz value but not the fastest processor.

Additionally it might be necessary to use a ''better'' compiler. I was using the GNU tool chain. I think with Green Hills or Keil it will be possible to achieve better results.

Regards

Squonk

Quote:

On 16-10-2008 at 13:35, Anonymous wrote:

Hi,

I'm working on a STM32 product (STM32 Primer) and have ported Dhrystone 2.1.

Squonk, If I look to your best score (you said ''55260 Dhrystone/s''), that means that you achieve about 31.45 DMips (I divided by 1757, the number of Dhrystones per second obtained on the VAX 11/780, nominally a 1 MIPS machine -

http://en.wikipedia.org/wiki/Dhrystone

)

But compared to your frequency, which is 72MHz, that gives... only 0.43 DMips /MHz!!!

The product is supposed to be about 1.25DMips/MHz without wait state. I do not know with 2 wait-state, but that's a big difference.

I have tested Dhrystone on my STM32 Primer (or STM32Circle), and I have (with -O3, at 12MHz / 0 Wait state): 0.70 DMips/MHz. This is far from the 1.25 number given on the datasheet.

Can anyone please explain me how to achieve the 1.25 number?

jrnore
Associate II
Posted on May 17, 2011 at 12:19

Quote:

On 20-10-2008 at 16:45, Anonymous wrote:

Frank,

The ACR is documented in the Flash programming manual. (13259.pdf) According to this, bit 4 in the ACR is the prefetch buffer enable bit.

Obviously turning on the prefetch will speed up sequential accesses and remove some of the hit you take by having wait states.

Good luck!

-Ben

Thanks for the information!

jrnore
Associate II
Posted on May 17, 2011 at 12:19

Quote:

On 20-10-2008 at 17:57, Anonymous wrote:

Hi,

I think the means to achieve 1,25DMips/MHz is to operate the device at 48MHz with 0 wait states. In this case You achieve the highest DMips/MHz value but not the fastest processor.

Additionally it might be necessary to use a ''better'' compiler. I was using the GNU tool chain. I think with Green Hills or Keil it will be possible to achieve better results.

Regards

Squonk

Hi Squonk,

I have ported Dhrystone2 and get 0.46DMips/MHz at 12MHZ with 2WS, which is almost the same result than you.

With 0 wait-state, I have 0.70 DMips/MHz, which seems coherent (the drop with/without wait state seems reasonable).

I think that the solution to get 1.25 DMips/MHz is to have a new compiler.

What is strange is that I am using GCC (v4.2.2) to bench a STM32 competitor (I don't know if I can tell who...) and I can reproduce the 1.49 DMips/MHz... that is described in their corresponding datasheet.

It would be nice if ST can tell how they get the 1.25 number, because I would like to reproduce it by myself.

Anyway, many thanks for your help!

Frank

disirio
Associate II
Posted on May 17, 2011 at 12:19

Quote:

On 20-10-2008 at 11:05, Anonymous wrote:

Hi Giovanni,

What is the ACR register of the STM32? What STM32 did you use?

I have benched the STM32F103RB (STM32 Primer) and adding 2 wait-state decrease significantly the performance. BTW, there is absolutely nothing in the stm32f103rb datasheet about peripherals interface (registers, ...).

Regards,

Frank

I ran the tests on the stm32f103rb too.

Ben already answered about the ACR. The prefetch buffers are just mentioned in the reference manual, the settings are described in the flash programming manual. Note that the prefetch buffers are enabled by default after reset, see the default value into the ACR register.

regards,

Giovanni

---

ChibiOS/RT

http://chibios.sourceforge.net

jrnore
Associate II
Posted on May 17, 2011 at 12:19

Quote:

With 0 wait-state, I have 0.70 DMips/MHz, which seems coherent (the drop with/without wait state seems reasonable).

This was with arm-none-eabi-gcc v4.2.1.

I have tested with v4.3.2 (the latest release) and get (only) 0.748 DMips/MHz with 0 wait-state. Still far from the 1.25 DMips/MHz expected from the datasheet...

:|

Frank

slawcus
Associate II
Posted on May 17, 2011 at 12:19

Where did you get this ''MISP test''?