Reproducing load/store timings claimed by ARM on STM32F4

henkdevriesst · ‎2016-01-26

Posted on January 26, 2016 at 17:44

Hi,

I'm programming an STM32F407VG in assembly, measuring performance of some code by reading DWT_CYCCNT, and I can't reproduce the cycle counts as

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDIJAFG.html

for the Cortex M4.

I find that

- after disabling the instruction cache (FLASH_ACR_ICE)

- after disabling the data cache (FLASH_ACR_DCE)

- after disabling the flash prefetch buffer (FLASH_ACR_PRFTEN)

- after clocking down to 24 MHz to have zero wait states for reading

from flash (and confirming that this is the case by reading RCC_CFGR)

- after ensuring that my 32-bit instructions are word-aligned

- after ensuring that I am loading an aligned word

- after ensuring that there is no dependency with neighbouring

instructions

- after ensuring that the assembler/linker are not introducing extra instructions by checking the objdump of the binary

a single ''LDR Rx, [Ry,&sharpimm]'' instruction always takes 3 cycles, or n+2 when pipelining multiple loads. ARM claims that it can be done in 2 cycles or n+1, respectively.

Where does this additional cycle come from and is it possible to get rid of it?

Thanks!

#cycle-count-cortex-m4-arm-load

waclawek.jan · ‎2016-01-27

Posted on January 27, 2016 at 16:44

> My goal is to understand exactly what is happening, so it would be awesome if someone could explain...

I know for sure the ST support is willing to start up their incredibly expensive simulators capable of cycle-precision simulation of the whole chip, provided you present a good enough incentive...

Meantime, some more food for thoughts:

// load-nop-load, instead of the 3 consecutive loads

TEST_SRAM1_LNL, // 11

TEST_SRAM2_LNL, // 13

TEST_CCRAM_LNL, // 12

// one more nop inserted

TEST_SRAM1_LNNL, // 12

TEST_SRAM2_LNNL, // 14

TEST_CCRAM_LNNL, // 13

JW