True performance of STM32?

lspr35 · ‎2008-11-17

Posted on November 17, 2008 at 13:39

#stm32

16-32micros · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hi ivan,

Just to know, how did you get these results about STR and LDR cycles ?

The fact of having 18MHz Of GPIO Toggle is due to the data path from the Core to the APB registers , even it is clocked at 72MHz , a store STR on APB is performed in two cycles if AHB/APB ratio is equal to 1. but not the case for the internal SRAM, DMA connected to the System Bus whre a store will take only 1 cycle. For more details on DMA and Bus cycles you can have a look on

http://www.st.com/stonline/products/literature/an/13529.pdf

: Using the STM32F101xx and STM32F103xx DMA controller

Regards, STOne-32

ivan239955_stm1_st · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hi STOne-32,

I was debugging with KEIL uVision3 and measuring execution time of each instruction. I can not verify STM timing right now with real HW as my board is damaged and I'll heve new early next weak.

However, I've just run it on Keil simulator that is supposed to be cycle accurate and it shows 2 cycles for STR Rx,[Ry,#imm] (for SRAM destination).

I'll reapeat measurements when my board is ready.

thanks, Ivan

16-32micros · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hi ivan,

Are you fectching your code from SRAM or internal Flash ? In case of flash at 0 wait state we should have accurate cycles as described in the Cortex-M3 TRM from ARM :

Load-store timings :

This section describes how best to pair instructions. This achieves more reductions in timing.

â€¢ STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing. If the store is to the store buffer, and the store buffer is full, the next instruction is delayed until the store can complete.

If the store is not to the store buffer (such as to the Code segment) and that transaction stalls, the impact on timing is only felt if another load or store operation is

executed before completion.

â€¢ LDR Rx!,[any] is not normally pipelined. That is, base update load is generally at least a two-cycle operation (more if stalled). However, if the next instruction does not require to read from a register, the load is reduced to one cycle. Non register reading instructions include CMP, TST, NOP, and non-taken IT controlled instructions.

â€¢ LDR PC,[any] is always a blocking operation. This means minimally two cycles for the load, and three cycles for the pipeline reload. So at least five cycles (more if stalled on the load or the fetch).

â€¢ LDR Rx,[PC,#imm] might add a cycle because of contention with the fetch unit.

â€¢ TBB and TBH are also blocking operations. These are minimally two cycles for the load, one cycle for the add, and three cycles for the pipeline reload. This means at least six cycles (more if stalled on the load or the fetch).

â€¢ LDR any are pipelined when possible. This means that if the next instruction is an LDR or non-base updating STR, and the destination of the first LDR is not used to compute the address for the next instruction, then one cycle is removed from the cost of the next instruction. So, an LDR might be followed by an STR, so that the STR writes out what the LDR loaded. More multiple LDRs can be pipelined together.

Some optimized examples:

â€�? LDR R0,[R1]; LDR R1,[R2] - normally three cycles total

â€�? LDR R0,[R1,R2]; STR R0,[R3,#20] - normally three cycles total

â€�? LDR R0,[R1,R2]; STR R1,[R3,R2] - normally three cycles total

â€�? LDR R0,[R1,R5]; LDR R1,[R2]; LDR R2,[R3,#4] - normally four cycles

total.

Anyway, I will double check your results on both Simulator and Real Hardware of STM32 This night and I will let you know my findings tomorrow :)

Cheers, STOne-32

ivan239955_stm1_st · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hi STM32,

I've made additional measurements with new board and scope and indeed, STR Rx,[Ry,#imm] is one cycle for RAM access. For GPIO access, there is additional wait cycle if another STR follows immediately.

Interesting thing is that wait cycle is missing if there is non-memory accessing instruction between two consequent GPIO access intruction. This is nice feature for bit-banging applications - you can add additional instructions without slowing down generated waveform.

To demonstrate this, code bellow generates 18 MHz signal (2 cycles per GPIO state) even though I've put additional instructions into GPIO write stream.

//sysclock,APB2 =72 MHZ, flash latency=2, PrefetchBuffer enabled, GCC in Crossworks, wiggler

ldr r2,=0x4001080c //r2 is addres of port E ODR

ldr r1,=0x1 // r1= 1

movs r0,#0 // r0= 0

str r0,[r2,#0] //GPIO pin =0

str r1,[r2,#0] //GPIO pin =1 1+1 cycles

str r0,[r2,#0] //GPIO pin =0 1+1 cycles

str r1,[r2,#0] //GPIO pin =1 1+1 cycles

mov r4,r4 // 1 cyc. ''free'' instruction - no additional cycle

str r0,[r2,#0] //GPIO pin =0 1+0 cycles

add r4,#0 //1cyc. ''free'' instruction - no additional cycle

str r1,[r2,#0] //GPIO pin =1 1+0 cycles

str r0,[r2,#0] //GPIO pin =0 1+1 cycles

str r1,[r2,#0] //GPIO pin =1 1+1 cycles

...

If you use RAM accessing instructions inside GPIO stream, situation gets more complicated, additional wait cycles are generated e.g. STR Rx,[Ry,#imm] to RAM will be 2 cycles.

Ivan

16-32micros · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Dear Ivan,

I'm happy to see that you get exactely what I said you before :) and even more and now you become an expert of our STM32 ;) , congratulations.

You are absolutely right, In fact this is called in our embedded language ;) Back-to-Back Writes to Peripherals :

Periph_registerA = x;

Periph_registerB = y;

Periph_registerC = z;

will often be less efficient than:

Periph_registerA = x;

y = Data processing ( MOV , ORR , DIV, MUL etc...)

Periph_registerB = y;

z = Data processing ( MOV , ORR , DIV, MUL etc...)

Periph_registerC = z;

The reason is that stores can complete in the background on STM32. This means that the STR generally takes two cycle or more (depending on what precedes it and depneding on APB prescalers - wait states). So, if two STRs are back-to-back, then the second one must wait for the first to complete. If other activity precedes the STR, it will not stall, as a store buffer will drain the operation out. However this is not the case for Loads ;-( and Unlike the STR case, the Cortex-M3 has to

wait for the load to complete no matter what. So, back-to-back LDRs (and an LDM) optimize the time by pipelining the address generation. So, it is better to keep loads together when possible.

Now if you run code from RAM , we will loose the harvard architecture ;-( and you will proceed with a Von-Neuman one an dthe System Bus will be stalled more times...

PS: I will escalate the issue to Keil Âµvision staff about their an-accurante simulator profiling cycles and abou the gap of the real silicon.

However , You have to note that Storing or loading non-alignnd 32-bits boundaries data will add additional cycles.. to keep in mind always :|

Regards, STOne-32 :o

lspr35 · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hello,

I set up a Dhrystome - Test of the STM32. My environment is the following:

Evaluation Board: STM32Circle

uP: STM32F103RBT6 Rev. A silicon (running at 72MHz, 2 flash waitstates)

Compiler: GCC 4.2.1

Tools: Raisonance

The duration of the Dhrystone loop was measured using the SYSTICK timer. Since the timer overflows after approx. 233ms I had to count the timer overflows in a ISR. I know this causes a faulty measurement but I think that the error is very small.

The results are as follows:

Optimize -O3: 55260 Dhrystones / s (size: 1664 Bytes)

Optimize -O2: 46790 Dhrystones / s (size: 1564 Bytes)

Optimize -O1: 45871 Dhrystones / s (size: 1684 Bytes)

Optimize none: 32808 Dhrystones / s (size: 2428 Bytes)

Optimize size: 47125 Dhrystones / s (size: 1458 Bytes)

If I set the flash access to 1 waitstate (not allowed) I receive the result:

Optimize -O3: 67038 Dhrystones / s (size: 1664 Bytes)

(So it seems as if it would be a good idea for ST to use a faster flash in order to enable 1 waitstate.)

The size is the value for the modules dhry_1.c + dhry_2.c only! The optimization ''-O3'' gives 17% faster code than size optimization with 14% more code.

Here You can find the results of an LPC2129 at 60MHz.

http://www.rowley.co.uk/arm/arm_bench.htm

I don't know if these results are comparable, but I think that the result of the STM32 is quite good.

It would be interesting if someone out there can do the same test with a different compiler (e.g. Keil). So I attach my project in order to make this test easier. Results from other micros would be interesting (e.g. ARM9, ARM7, ST10, ...).

Regards

Squonk

[ This message was edited by: Squonk on 15-01-2008 22:56 ]

[ This message was edited by: Squonk on 15-01-2008 23:20 ]

jrnore · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Hi,

I'm working on a STM32 product (STM32 Primer) and have ported Dhrystone 2.1.

Squonk, If I look to your best score (you said ''55260 Dhrystone/s''), that means that you achieve about 31.45 DMips (I divided by 1757, the number of Dhrystones per second obtained on the VAX 11/780, nominally a 1 MIPS machine -

http://en.wikipedia.org/wiki/Dhrystone

)

But compared to your frequency, which is 72MHz, that gives... only 0.43 DMips /MHz!!!

The product is supposed to be about 1.25DMips/MHz without wait state. I do not know with 2 wait-state, but that's a big difference.

I have tested Dhrystone on my STM32 Primer (or STM32Circle), and I have (with -O3, at 12MHz / 0 Wait state): 0.70 DMips/MHz. This is far from the 1.25 number given on the datasheet.

Can anyone please explain me how to achieve the 1.25 number?

slawcus · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Did anyone check the DSP lib for STM32? Unless this lib is rather poorly written I'd say that STM32 is not so great performer. Also there is close to none performance gain when code is executed at 72 MHz compared to 48 Mhz. Not worth the extra mA.

Now compare this to STR9 benchmarks.

ccc :-?

[ This message was edited by: slawcus on 16-10-2008 22:41 ]

[ This message was edited by: slawcus on 16-10-2008 22:42 ]

jrnore · ‎2011-05-17

Posted on May 17, 2011 at 12:19

I am currently benching the FFT and try to assemble the following file:

http://www.st.com/mcu/modules/Splatt_Forums/downloadtemp/FFTCM3.s

I am using Ride7, but I can not assemble the file, it says it's a ''C source file'', I can not change it to ''Assembly source file''. More over, it does not seems to do preprocessing of my file. How can I compile it?

Frank.

[edit]: I have manually assembled them with the arm gcc compiler, then added the object into the project. FFT is working. Strange that the IDE did not want to assemble it...

[ This message was edited by: jrnore on 17-10-2008 18:33 ]

disirio · ‎2011-05-17

Posted on May 17, 2011 at 12:19

Quote:

On 16-10-2008 at 22:36, Anonymous wrote:

Did anyone check the DSP lib for STM32? Unless this lib is rather poorly written I'd say that STM32 is not so great performer. Also there is close to none performance gain when code is executed at 72 MHz compared to 48 Mhz. Not worth the extra mA.

I just ran some tests at 48 and 72 MHz, I can see a good improvement going from 48 to 72:

***************************************************************************

Kernel: ChibiOS/RT 0.7.2

Compiler: GCC 4.3.2 (YAGARTO 28.09.2008)

Options: -O2 -fomit-frame-pointer -mabi=apcs-gnu -falign-functions=16

Settings: SYSCLK=48, ACR=0x11 (1 wait state)

***************************************************************************

-------

--- Test Case 13 (Benchmark, context switch #1, optimal)

--- Score : 160572 msgs/S, 321144 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 14 (Benchmark, context switch #2, empty ready list)

--- Score : 134029 msgs/S, 268058 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 15 (Benchmark, context switch #3, 4 threads in ready list)

--- Score : 134029 msgs/S, 268058 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 16 (Benchmark, threads creation/termination, worst case)

--- Score : 105399 threads/S

--- Result: SUCCESS

-------

--- Test Case 17 (Benchmark, threads creation/termination, optimal)

--- Score : 137112 threads/S

--- Result: SUCCESS

-------

--- Test Case 18 (Benchmark, mass reschedulation, 5 threads)

--- Score : 42051 reschedulations/S, 252306 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 19 (Benchmark, I/O Queues throughput)

--- Score : 377572 bytes/S

--- Result: SUCCESS

-------

***************************************************************************

Kernel: ChibiOS/RT 0.7.2

Compiler: GCC 4.3.2 (YAGARTO 28.09.2008)

Options: -O2 -fomit-frame-pointer -mabi=apcs-gnu -falign-functions=16

Settings: SYSCLK=72, ACR=0x12 (2 wait states)

***************************************************************************

-------

--- Test Case 13 (Benchmark, context switch #1, optimal)

--- Score : 216994 msgs/S, 433988 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 14 (Benchmark, context switch #2, empty ready list)

--- Score : 178662 msgs/S, 357324 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 15 (Benchmark, context switch #3, 4 threads in ready list)

--- Score : 178663 msgs/S, 357326 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 16 (Benchmark, threads creation/termination, worst case)

--- Score : 141108 threads/S

--- Result: SUCCESS

-------

--- Test Case 17 (Benchmark, threads creation/termination, optimal)

--- Score : 187046 threads/S

--- Result: SUCCESS

-------

--- Test Case 18 (Benchmark, mass reschedulation, 5 threads)

--- Score : 55768 reschedulations/S, 334608 ctxswc/S

--- Result: SUCCESS

-------

--- Test Case 19 (Benchmark, I/O Queues throughput)

--- Score : 489476 bytes/S

--- Result: SUCCESS

-------

Note that the the wait states are different between the 2 tests: 1 wait state at 48MHz and 2 wait states at 72MHz.

regards,

Giovanni

---

ChibiOS/RT

http://chibios.sourceforge.net

[ This message was edited by: disirio on 18-10-2008 11:26 ]