How to obtain 1.25 MIPS

Roman K · ‎2018-02-16

Posted on February 16, 2018 at 14:52

Hello

With my previous experience with Atmel 8bit MCUs, which have 1MIPS/MHz perfomance, I had exactly 1 executed instruction per systick.

Now I'm using STM32F103. I noted from datasheet that its perfomance is 1.25 DMIPS/MHz. So I wrote small assembler program, in short:

LDR param0, [R6] ; param0 receiver, R6 contains address in periph bit-bang

STR param0, [R7], #4 ; R7 contains address in SRAM bit-bang

B Loop ;

There's no prescalers neither for AHB not for APB1/2. I downloaded this small code in embedded SRAM, set flash latency to 0, disabled flash prefetch buffer, off all interrupts and DMA.

Then I measured how fast executes this code from SRAM. The result is that one command takes 4 systicks (branch takes 8), and actual perfomance is 0.25 MIPS/MHz.

What I did wrong? Or misunderstood?

henry.dick · ‎2018-02-16

Posted on February 17, 2018 at 04:25

'

the SoC can not complete the task I wanted to... Very sad'

very sad at the continued comparison of apples to oranges.

Roman K · ‎2018-02-17

Posted on February 17, 2018 at 08:25

It is brand new for me :) I just started with ARM and STM32F the cheapest of all that I know. So I bought F103 to test with.

The task was some sort of digital analyzer: reading IDR with minimum latency, break a loop with interrupt and push data to usart. I wanted to analyze 12 MHz (max) signals.

Jan Waclawek · ‎2018-02-17

Posted on February 17, 2018 at 11:15

Strange indeed.

Do you use some debugger? Try to disconnect it.

And try to run it from FLASH.

JW

Roman K · ‎2018-02-17

Posted on February 17, 2018 at 12:13

Yes, debugger connected, but there is no debug session. And debugger is not active according to led. I'll try with disconnected, but I don't really think that something will change.

In my opinion the program should run from SRAM faster, isn't it?

waclawek.jan · ‎2018-02-18

Posted on February 18, 2018 at 14:03

In my opinion the program should run from SRAM faster, isn't it?

It depends.

Look at the memory system, usually the very first figure in RM (the figure in RM0008 is not very illustrative, look into RM0090 instead to have a bettere picture - yes it's a different MCU but the rough scheme is the same). The processor has three ports - I to fetch instructions, D to read data - these two are used for addresses 0000'0000-1FFF'FFFF. The third, (S for system) is used to read/write (including instruction fetching) for the rest of the address space (except for the small area near the end of address space used to access internal resources of the processor). As S-bus is heavily loaded and many connected peripherals may struggle to meet tight timing at read, its input is registered (

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337g/BABGAJAB.html

- unfortunately ARM removed this level of information from the TRM for newer revisions of Cortex-M3 and is not included in Cortex-M4, but I believe it still pertains), i.e. the data it reads (including instruction fetches) arrive into the processor one cycle later (or maybe even more, depending on how exactly the bus arbiter in the bus matrix works, again this is not easy to get this information) counting from start of the read/fetch, compared to reading data/instructions through D and I port This all does not take into account any delays due to busmatrix arbitration (i.e. due to conflicts with other masters in the matrix, such as the DMA controllers) - this is not the case in your particular simplistic example, but something you have to bear in mind later.

Writing from processor is registered, i.e. there is an output register on the S-bus and writes go into it so after a STR the processor does not need to wait until the write actually happens. That is, if the write register is not disabled (which is not after reset, see SCB_ACTLR.DISDEFWBUF e.g. in PM0056), and if there are no pending data in the write register from previous writes. The latter may happen e.g. due to conflict on the target bus in matrix (again due to other masters, DMA etc.), or if the target imposes some waitstates - in this particular case, the target is the AHB/APB bridge as GPIOs are on APB, which has its own write register, but may impose waitstates due to resynchronization between the buses even if APB runs with no divisor - I know of no such intimate detail being published about this bridge.

In the STM32s, when fetching from FLASH, the user is bound to set waitstates with increasing system clock, as the FLASH is relatively slow. So, while running at slow system clock FLASH may run with zero waitstates and fetching from it benefits from no conflicts with other bus operations, neither internal in processor nor external as long as there are no other masters (DMA) active on that bus.

In 'F1, fetching from it SRAM means read through the S-bus, as SRAM is fixedly mapped at 0x2000'0000. That imposes the additional penalty related to the registered reads on S-bus mentioned above. But what's maybe even more important, fetching through S-bus conflicts internally in the processor with other accesses through that bus (here: with the writes to the GPIO), so an internal arbiter has to steer between the two, inevitably slowing down the resulting execution.

Adding these two impacts together, execution from SRAM may be slower while FLASH has low count of waitstates.

The Cortex-M3 processor fetches instructions in 32-bit chunks, the instructions being 16-bit and 32-bit wide. It has a prefetch pipeline (3 words deep in M3, IIRC) inside the processor, so if a 32-bit word is fetched and only one 16-bit instruction is consumed, and there is no branch, the next consecutive 16-bit instruction comes in with no fetch time consumed at the processor interfaces, and that time is then used for further fetching. Branches of course purge the prefetch and have to wait until next instruction is fetched. So, instruction later in a linear pipe after the branch may start to execute faster after some time the pipeline starts to fill (maybe that is what you see on that waveform).

---

The 'F2/F4 are better off than 'F1 in this particular aspect in several details: their SRAMs can be remapped onto I/D-buses, there is a jumpcache (and a smaller datacache, together called ART) on the FLASH which has a wide data bus (128 bits), GPIOs are on AHB i.e. not behind an AHB/APB bridge, their DMA may avoid collisions by going into peripherals outside the matrix due to a dual-port AHB/APB bridge, they have several SRAMs so that sometimes collisions can be avoided also on the busmatrix level.

Further reading: ARM Cortex-M3 TRM (at arm.com), the ARM-v7M ARM (at arm.com),

https://community.st.com/message/48391

and the Technical Update &sharp1 mentioned there.

---

As you've seen, things are not simple nor trivial, to explain, to understand, to cope with them. You are simply not supposed to time your processes with the processor. For timing-dependent processes, you are supposed to use hardware. There's a plenty of it in these 32-bitters. For your particular task, you might want to consider using a timer-triggered DMA from GPIO to memory for the 'sampling' process itself. Avoiding the processor for the timing-critical process, yet still a lot of things from what has been mentioned above to consider.

---

I understand your frustration. Users who were used to exercise tight control in the 8-bit 'true microcontrollers' are often disappointed when migrating to 32-bitters. However, the performance comes at a cost. The 32-bitters are touted as very fast. In fact, the transistors/gates themselves are *not* that much faster. For example, it would be dead easy to run the AVR processor core at a hundred of MHz or more - but the memories and peripherals and interconnecting buses and the IO wouldn't keep up with the pace. Thus, much if not most of the speed difference comes from having much more of those transistors, i.e. from parallelism, pipelining/buffering, separating parts of system to run at different clocks necessitating resynchronizers in between them, and similar concepts (leading at its extreme in the processor cores to out-of-order execution and superscalar architectures). These all interact and result in very complex timing at the clock level, which are hard to simplify, and given the individual elements coming into play come often from different vendors, it is also hard to gather all data needed for an exact simulation (read: clock-precise simulators are only at the hand of final integrator, here ST, and their usage is cumbersome thus costly).

The idea is, that much of what happens at the clock level is waived at. The vast majority of users are far from utilizing these chips (nor even the 8-bitters) down to that level, and even when they inadvertently meet those restrictions, they usually don't try to understand them and find a way to cope with them, but they simply move to yet another, faster and bigger model. The manufacturers/marketing support this notion; why would they make things harder to themselves. Who is interested in the nasty details anyway? And the synthetic benchmarks (Dhrystone, CoreMark) used to rank the mcus are just that, syntethic, dealing with raw computing power, with just indirectly benchmarking the memories/bus system and completely avoiding any interaction with peripherals. This for mcu may sound inadequate, but then, again, making a more adequate yet universal enough benchmark would be very complicated, tiresome, with hard to interpret results... so a costly thing (at the end of the day yes, you pay it in the price of chips) with dubious results...

This all is about giving up control for speed, much as with going to high level programming language you give up control for convenience. Get used to it.

My 2 eurocents.

JW

Roman K · ‎2018-02-18

Posted on February 18, 2018 at 14:58

waclawek.jan, thank you for your very detailed response with so helpful and interesting information.

Adding these two impacts together, execution from SRAM may be slower while FLASH has low count of waitstates.

I see. Anyway, my point was to get the highest freq as possible, so I could not 'afford' low count of flash waitstates. By the way, I was able to run my f103 at maximum freq 128 MHz :) I know this is illegal, but it was very stable at 128 MHz (it was communicating with my PC via USART while running at so high freq). But as soon as I rise freq to 144 MHz, the mcu starts to fall into exception with random period. But it's an offtop anyway.

The idea is, that much of what happens at the clock level is waived at.

This all is about giving up control for speed, much as with going to high level programming language you give up control for convenience. Get used to it.

I was affraid of that.

The 'F2/F4 are better off than 'F1

I already ordered F405, wanna see what can do this SoC*k* :)

For your particular task, you might want to consider using a timer-triggered DMA from GPIO to memory for the 'sampling' process itself.

Hmm, i should try that. I didn't meet DMA yet, and I do not know all the possibilities. Maybe your suggestion is a good reason for me to start.

henry.dick · ‎2018-02-18

Posted on February 18, 2018 at 19:17

'

my point was to get the highest freq as possible, '

you have gone from wanting to know how to run the chip at 1.25MIPS/Mhz, to what is DMIPS, to why it differs from MIPS, to now wanting to run it as fast as possible.

maybe you can help yourself better if you tell others what you want to do, in one sentence.

'I know this is illegal'

it is not illegal. just not advisable.