2017-01-24 12:18 PM
Hello
suppose this code:
array[0] = GPIOA->IDR;
array[1] = GPIOA->IDR;
.
.
.
array[n] = GPIOA->IDR;
Except first line, all other lines are assembled with a LDR and STRH instructions (tested with keil uvision IDE) that, as mentioned in arm cortex m4 TRM, they take 2 cpu cycles each:
.
.
.
LDR r5, [r0,#0x00]
STRH r5, [r1,#0xF92]
.
.
.
about STR it is mentioned
in arm cortex m4 TRM that:
STR Rx,[Ry,#imm]
is always one cycle. This is because the address generation is performed in the
initial cycle, and the data store is performed at the same time as the next instruction is executing. If
the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is
delayed until the store can complete. If the store is not to the write buffer, for example to the Code
segment, and that transaction stalls, the impact on timing is only felt if another load or store operation
is executed before completion.
And about LDR:
LDR [any]
are pipelined when possible. This means that if the next instruction is an
LDR
or
STR
, and
the destination of the first
LDR
is not used to compute the address for the next instruction, then one
cycle is removed from the cost of the next instruction. So, an
LDR
might be followed by an
STR
, so
that the
STR
writes out what the
LDR
loaded. More multiple
LDR
s can be pipelined together. Some
optimized examples are:
—
LDR R0,[R1]; LDR R1,[R2]
- normally three cycles total.
—
LDR R0,[R1,R2]; STR R0,[R3,#20]
- normally three cycles total.
—
LDR R0,[R1,R2]; STR R1,[R3,R2]
- normally three cycles total.
—
LDR R0,[R1,R5]; LDR R1,[R2]; LDR R2,[R3,#4]
- normally four cycles total.
So we can assume that each array[i] = gpio->idr should take 3 cpu cycles (or 2? because of pipelining), here is the result of reading 1000 lines of such code, one just after another:
f3_prefetch disable: 4495 cpu cycles
f3_prefetch enable: 2503 cpu cycles.So, how can we interpret these results?
Thanks.
2017-01-24 01:49 PM
I think the prefetch reduces the bus contention.
You'll get bubbles in the pipeline if the subsequent instruction has dependencies on the register loaded by the prior instruction.
Assume reads/writes across the AHB/APB may be slower
Would pairing speed things?
LDR r5, [r0,#0x00]
LDR r6, [r0,#0x00]
STRH r5, [r1,#0xF92]
STRH r6, [r1,#0xF94]
2017-01-24 09:19 PM
,
,
pairing may speed that, but it isn't applicable for 1000 of such repetition because of limited number of registers and specific needs that don't allow such pairing. i rewrite the assembled gpio->,idr with more lines:
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x00]
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x01]
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x02]
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x03]
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x04]
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x05]
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x06]
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x07]
LDR , , , r5, [r0, ♯ 0x00]
STRH , ,r5, [r1, ♯ 0x08]
.
.
.
And i think, as mentioned in TRM, each pair must take three clock cycles:
�?
LDR R0,[R1,R2], STR R0,[R3, ♯ 20]
- normally three cycles total.
but experiments show that it takes ~2.5 cpu cycles with prefetch enable and ~4.49 without prefetch. despite the similarity and repeating manner of whole instruction pairs, what causes this instability and error?
I've done such measurements with stm32f4 too, the results are more stable and there are only 2 cpu cycles error in 2000 pairs of ldr and strh instructions (As both stm32f3&,f4 are based on cortex m4 core, what causes this difference?)
i repeat this segment of TRM:
'If ,
the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is ,
delayed until the store can complete.'
What is the relationship between write buffer and prefetch? are they the same thing? do we have access to that for set/reset? what i found in stm32f4 reference manual (RM0090) is:
'Prefetch is ,
enabled
by setting
the PRFTEN bit in the FLASH_ACR register
'
2017-01-25 03:01 AM
You may also need to look at the effect of fetching the instruction from FLASH.
Your repeated code is made of one 16-bit instruction and one 32-bit instruction.
The STM32F3 series Flash memory is organized in 64-bit words, and the number of wait state is 2 when running at maximum speed, therefore 3 cycles to get 64-bit worth of instruction code.
As the prefetch (64-bit) and the instructions (16-bit + 32-bit) are not synchronized there may be steps when the CPU must still wait for the next fetch to continue.
On top of that, there might be some tricks here as the CPU could take advantage of the fetch wait to perform some of the conflicting operation on R5 in its pipeline. This is depending on the instruction synchronization with the fetch.
In the STM32F4 series, the Flash memory is organized in 128-bit words, and depending on the voltage range and operating frequency, the fetch could be shorter than the code execution time.
Do you know the number of wait state you were using for your test on F3 and on F4?
2017-01-29 01:14 PM
For stm32f3, flash latency is 2 and for stm32f4, this is 5.