2016-01-26 08:44 AM
Hi,
I'm programming an STM32F407VG in assembly, measuring performance of some code by reading DWT_CYCCNT, and I can't reproduce the cycle counts ashttp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDIJAFG.html
for the Cortex M4. I find that - after disabling the instruction cache (FLASH_ACR_ICE) - after disabling the data cache (FLASH_ACR_DCE) - after disabling the flash prefetch buffer (FLASH_ACR_PRFTEN) - after clocking down to 24 MHz to have zero wait states for reading from flash (and confirming that this is the case by reading RCC_CFGR) - after ensuring that my 32-bit instructions are word-aligned - after ensuring that I am loading an aligned word - after ensuring that there is no dependency with neighbouring instructions - after ensuring that the assembler/linker are not introducing extra instructions by checking the objdump of the binary a single ''LDR Rx, [Ry,&sharpimm]'' instruction always takes 3 cycles, or n+2 when pipelining multiple loads. ARM claims that it can be done in 2 cycles or n+1, respectively. Where does this additional cycle come from and is it possible to get rid of it? Thanks! #cycle-count-cortex-m4-arm-load2016-01-27 07:44 AM
> My goal is to understand exactly what is happening, so it would be awesome if someone could explain...
I know for sure the ST support is willing to start up their incredibly expensive simulators capable of cycle-precision simulation of the whole chip, provided you present a good enough incentive... Meantime, some more food for thoughts:// load-nop-load, instead of the 3 consecutive loads
TEST_SRAM1_LNL, // 11 TEST_SRAM2_LNL, // 13 TEST_CCRAM_LNL, // 12 // one more nop inserted TEST_SRAM1_LNNL, // 12 TEST_SRAM2_LNNL, // 14 TEST_CCRAM_LNNL, // 13 JW