stm32f3 assembly instruction execution timing accuracy

armindavatgaran · ‎2017-01-24

Posted on January 24, 2017 at 21:18

Hello

suppose this code:

array[0] = GPIOA->IDR;

array[1] = GPIOA->IDR;

.

array[n] = GPIOA->IDR;

Except first line, all other lines are assembled with a LDR and STRH instructions (tested with keil uvision IDE) that, as mentioned in arm cortex m4 TRM, they take 2 cpu cycles each:

.

LDR r5, [r0,#0x00]

STRH r5, [r1,#0xF92]

.

about STR it is mentioned

in arm cortex m4 TRM that:

STR Rx,[Ry,#imm]

is always one cycle. This is because the address generation is performed in the

initial cycle, and the data store is performed at the same time as the next instruction is executing. If

the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is

delayed until the store can complete. If the store is not to the write buffer, for example to the Code

segment, and that transaction stalls, the impact on timing is only felt if another load or store operation

is executed before completion.

And about LDR:

LDR [any]

are pipelined when possible. This means that if the next instruction is an

LDR

or

STR

, and

the destination of the first

LDR

is not used to compute the address for the next instruction, then one

cycle is removed from the cost of the next instruction. So, an

LDR

might be followed by an

STR

, so

that the

STR

writes out what the

LDR

loaded. More multiple

LDR

s can be pipelined together. Some

optimized examples are:

—

LDR R0,[R1]; LDR R1,[R2]

- normally three cycles total.

—

LDR R0,[R1,R2]; STR R0,[R3,#20]

- normally three cycles total.

—

LDR R0,[R1,R2]; STR R1,[R3,R2]

- normally three cycles total.

—

LDR R0,[R1,R5]; LDR R1,[R2]; LDR R2,[R3,#4]

- normally four cycles total.

So we can assume that each array[i] = gpio->idr should take 3 cpu cycles (or 2? because of pipelining), here is the result of reading 1000 lines of such code, one just after another:

f3_prefetch disable: 4495 cpu cycles

f3_prefetch enable: 2503 cpu cycles.

So, how can we interpret these results?

Thanks.

Tesla DeLorean · ‎2017-01-24

Posted on January 24, 2017 at 22:49

I think the prefetch reduces the bus contention.

You'll get bubbles in the pipeline if the subsequent instruction has dependencies on the register loaded by the prior instruction.

Assume reads/writes across the AHB/APB may be slower

Would pairing speed things?

LDR r5, [r0,#0x00]

LDR r6, [r0,#0x00]

STRH r5, [r1,#0xF92]

STRH r6, [r1,#0xF94]

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

armindavatgaran · ‎2017-01-24

Posted on January 25, 2017 at 05:19

,

pairing may speed that, but it isn't applicable for 1000 of such repetition because of limited number of registers and specific needs that don't allow such pairing. i rewrite the assembled gpio->,idr with more lines:

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x00]

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x01]

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x02]

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x03]

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x04]

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x05]

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x06]

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x07]

LDR , , , r5, [r0, ♯ 0x00]

STRH , ,r5, [r1, ♯ 0x08]

.

And i think, as mentioned in TRM, each pair must take three clock cycles:

â€�?

LDR R0,[R1,R2], STR R0,[R3, ♯ 20]

- normally three cycles total.

but experiments show that it takes ~2.5 cpu cycles with prefetch enable and ~4.49 without prefetch. despite the similarity and repeating manner of whole instruction pairs, what causes this instability and error?

I've done such measurements with stm32f4 too, the results are more stable and there are only 2 cpu cycles error in 2000 pairs of ldr and strh instructions (As both stm32f3&,f4 are based on cortex m4 core, what causes this difference?)

i repeat this segment of TRM:

'If ,

the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is ,

delayed until the store can complete.'

What is the relationship between write buffer and prefetch? are they the same thing? do we have access to that for set/reset? what i found in stm32f4 reference manual (RM0090) is:

'Prefetch is ,

enabled

by setting

the PRFTEN bit in the FLASH_ACR register

'

Max · ‎2017-01-25

Posted on January 25, 2017 at 11:01

You may also need to look at the effect of fetching the instruction from FLASH.

Your repeated code is made of one 16-bit instruction and one 32-bit instruction.

The STM32F3 series Flash memory is organized in 64-bit words, and the number of wait state is 2 when running at maximum speed, therefore 3 cycles to get 64-bit worth of instruction code.

As the prefetch (64-bit) and the instructions (16-bit + 32-bit) are not synchronized there may be steps when the CPU must still wait for the next fetch to continue.

On top of that, there might be some tricks here as the CPU could take advantage of the fetch wait to perform some of the conflicting operation on R5 in its pipeline. This is depending on the instruction synchronization with the fetch.

In the STM32F4 series, the Flash memory is organized in 128-bit words, and depending on the voltage range and operating frequency, the fetch could be shorter than the code execution time.

Do you know the number of wait state you were using for your test on F3 and on F4?

armindavatgaran · ‎2017-01-29

Posted on January 29, 2017 at 21:14

For stm32f3, flash latency is 2 and for stm32f4, this is 5.