Reproducing load/store timings claimed by ARM on STM32F4

Question asked by Henk on Jan 26, 2016
I'm programming an STM32F407VG in assembly, measuring performance of some code by reading DWT_CYCCNT, and I can't reproduce the cycle counts as promised by ARM for the Cortex M4.

I find that

- after disabling the instruction cache (FLASH_ACR_ICE)
- after disabling the data cache (FLASH_ACR_DCE)
- after disabling the flash prefetch buffer (FLASH_ACR_PRFTEN)
- after clocking down to 24 MHz to have zero wait states for reading
from flash (and confirming that this is the case by reading RCC_CFGR)
- after ensuring that my 32-bit instructions are word-aligned
- after ensuring that I am loading an aligned word
- after ensuring that there is no dependency with neighbouring
- after ensuring that the assembler/linker are not introducing extra instructions by checking the objdump of the binary

a single "LDR Rx, [Ry,#imm]" instruction always takes 3 cycles, or n+2 when pipelining multiple loads. ARM claims that it can be done in 2 cycles or n+1, respectively.

Where does this additional cycle come from and is it possible to get rid of it?