AnsweredAssumed Answered

Reproducing load/store timings claimed by ARM on STM32F4

Question asked by Henk on Jan 26, 2016
Latest reply on Jan 27, 2016 by waclawek.jan

I'm programming an STM32F407VG in assembly, measuring performance of some code by reading DWT_CYCCNT, and I can't reproduce the cycle counts as promised by ARM for the Cortex M4.

I find that

- after disabling the instruction cache (FLASH_ACR_ICE)
- after disabling the data cache (FLASH_ACR_DCE)
- after disabling the flash prefetch buffer (FLASH_ACR_PRFTEN)
- after clocking down to 24 MHz to have zero wait states for reading
from flash (and confirming that this is the case by reading RCC_CFGR)
- after ensuring that my 32-bit instructions are word-aligned
- after ensuring that I am loading an aligned word
- after ensuring that there is no dependency with neighbouring
- after ensuring that the assembler/linker are not introducing extra instructions by checking the objdump of the binary

a single "LDR Rx, [Ry,#imm]" instruction always takes 3 cycles, or n+2 when pipelining multiple loads. ARM claims that it can be done in 2 cycles or n+1, respectively.

Where does this additional cycle come from and is it possible to get rid of it?