2011-12-26 11:50 PM
I'm debugging on an Olimex STM32-P103. I think I have the PLL set up at 8 times the 8Mhz crystal. If so, then instruction timing is pretty poor. Here's my loop:
ldr r0, =portcldr r0, [r0]mov r1, #0x80mov r2, #0loop_io:str r1, [r0]str r2, [r0]b loop_ioportc: .word 0x4001100c2011-12-26 11:52 PM
Even this website doesn't work for me - Chrome on Linux. I'm getting 129 nS in the loop, 43 nS per instruction. Why''
2011-12-26 11:54 PM
I meant 9 times for 72 Mhz clock.
2011-12-27 05:00 AM
Ignoring for a second the impact of flash reads (~42ns), a write through buffer to an APB location is 4 cycles.
So optimally your loop is the order of 9 cycles. 72 MHz = 13.89 ns 9 x 13.89 = 125.01 ns2011-12-28 07:23 PM
So here's another timing quiz. STM32f103 Olimex board, 72 mHz.
#ifdef TESTING_TIMINGldr r0, =portcldr r0, [r0]ldr r1, [r0]bic r1, #0x80str r1, [r0]#endifldr r1, = tmuxr_system_timeldrd r0, r1, [r1]ldr r3, =tick_amount /* nS per tick */ldr r2, [r3]mov r3, #0adds r2, r2, r0adc r3, r3, r1ldr r0, =tmuxr_system_time /* the address of a 64 bit */strd r2, r3, [r0]// bl scope_off#ifdef TESTING_TIMINGldr r0, =portcldr r0, [r0]ldr r1, [r0]mov r2, #0x80orr r1, r2str r1, [r0]#endifThe scope says 840 nS.Here's the code generated by gcc.push {r0}ldr r0, [pc, #164] ldr r0, [r0, #0]ldr r1, [r0, #0]bic.w r1, r1, #128 str r1, [r0, #0] set scope lowldr r1, [pc, #156] ldrd r0, r1, [r1]ldr r3, [pc, #152] ldr r2, [r3, #0]mov.w r3, #0adds r2, r2, r0adc.w r3, r3, r1ldr r0, [pc, #136] strd r2, r3, [r0]ldr r0, [pc, #124] ldr r0, [r0, #0]ldr r1, [r0, #0]mov.w r2, #128 orr.w r1, r1, r2str r1, [r0, #0] set scope highThat's 15 instructions for an average of 56 nS per. There is one read ofthe portc register and 2 stores to it so 3 AHPB operations.Does this timing make sense?2011-12-29 10:22 AM
As entertaining an exercise as that might be, the literal loads (via PC) are likely to be quite expensive (as they are apt to expose the speed of the flash array, compared to prefetch which get masked), and the indirect, then R-M-W of the GPIO register also seems rather inefficient. I'd tend to use the GPIOx_BSRR and GPIOx_BRR, and cache constants in registers. Still 60 cycles does seem rather high.
Consider also using the DWT cycle count (DWT_CYCCNT) to benchmark instruction cycle measurements.