Reproducing load/store timings claimed by ARM on STM32F4

henkdevriesst · ‎2016-01-26

Posted on January 26, 2016 at 17:44

Hi,

I'm programming an STM32F407VG in assembly, measuring performance of some code by reading DWT_CYCCNT, and I can't reproduce the cycle counts as

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDIJAFG.html

for the Cortex M4.

I find that

- after disabling the instruction cache (FLASH_ACR_ICE)

- after disabling the data cache (FLASH_ACR_DCE)

- after disabling the flash prefetch buffer (FLASH_ACR_PRFTEN)

- after clocking down to 24 MHz to have zero wait states for reading

from flash (and confirming that this is the case by reading RCC_CFGR)

- after ensuring that my 32-bit instructions are word-aligned

- after ensuring that I am loading an aligned word

- after ensuring that there is no dependency with neighbouring

instructions

- after ensuring that the assembler/linker are not introducing extra instructions by checking the objdump of the binary

a single ''LDR Rx, [Ry,&sharpimm]'' instruction always takes 3 cycles, or n+2 when pipelining multiple loads. ARM claims that it can be done in 2 cycles or n+1, respectively.

Where does this additional cycle come from and is it possible to get rid of it?

Thanks!

#cycle-count-cortex-m4-arm-load

waclawek.jan · ‎2016-01-26

Posted on January 26, 2016 at 19:21

Show us more context; preferrably a minimal but easily reproducible example.

JW

re.wolff9 · ‎2016-01-26

Posted on January 26, 2016 at 21:58

It COULD be that what STM calls ''zero waitstate'' is actually ONE waitstate.

(I've seen a hint to that effect somewhere, but I don't remember what of where).

Update: maybe (Clive knows what he's talking about) here: https://my.st.com/public/STe2ecommunities/mcu/Lists/cortex_mx_stm32/Flat.aspx?RootFolder=%2fpublic%2fSTe2ecommunities%2fmcu%2fLists%2fcortex_mx_stm32%2fcache&FolderCTID=0x01200200770978C69A1141439FE559EB459D7580009C4E14902C3CDE46A77F0FFD06506F5B&currentviews=29

gregstm · ‎2016-01-26

Posted on January 27, 2016 at 07:36

Try adding ''.w'' to ldr (and str). It helped with my routines. 32bit instructions work better with the pipeline.

henkdevriesst · ‎2016-01-27

Posted on January 27, 2016 at 09:31

Take the following example:

=== somefunction.s ===

.syntax unified
.cpu cortex-m4
.align 4
.global somefunction
.type somefunction,%function
somefunction:
eor.w r0, r1
eor.w r0, r1
eor.w r0, r1
//ldr.w r2, [sp, #0]
//ldr.w r2, [sp, #0]
eor.w r0, r1
eor.w r0, r1
eor.w r0, r1
eor.w r0, r1
eor.w r0, r1
bx lr
.size somefunction, .-somefunction
.end

=== main.c ===

//includes...
extern
unsigned 
int
somefunction(
void
);
int
main(
void
) {
//(...)
unsigned 
int
oldcount = DWT_CYCCNT;
unsigned 
int
res = somefunction();
unsigned 
int
cyclecount = DWT_CYCCNT-oldcount; 
//yes, this goes wrong when the cycle counter overflows, but that is irrelevant here
sprintf
(output, 
''cyc: %d''
, cyclecount);
send_USART_str(output); 
//defined elsewhere, irrelevant
//(...)
}

This takes 14 cycles. After adding one ldr.w, it always takes 17 cycles, regardless the position of the instruction (both absolute and relative to the ICode fetches). I'm using only 32-bit instructions here (except for the bx lr, but that is constant) to experiment and not mess up alignment. I find it hard to believe that ARM would post timings that depend on the availability of a cache, as the Cortex M4 by itself does not have any caches.

AvaTar · ‎2016-01-27

Posted on January 27, 2016 at 10:35

Try executing at low clock frequency, e.g. 16 MHz.

Actual Flash speed is vendor dependent, and had been 25 MHz for ST lately. Such caches / instruction prefetch buffers are a good way to hide Flash slowness.

Competitors are better here - this waitstate limit is 40 MHz for TI TM4C, 75 MHz for Spansion FM3/FM4, and 100 MHz for Renesas RX600. At least those are the numbers I heard last time ...

henkdevriesst · ‎2016-01-27

Posted on January 27, 2016 at 11:24

I said that I clocked it down to 24 MHz. That should be sufficiently low then, right?

GHITH.Abdelhamid · ‎2016-01-27

Posted on January 27, 2016 at 12:10

Hello,

It will be interesting to have the SP register value.

BR,

AvaTar · ‎2016-01-27

Posted on January 27, 2016 at 12:19

> I said that I clocked it down to 24 MHz. That should be sufficiently low then, right?

Yes, if you actually configured zero wait-states in FLASH_ACR.

waclawek.jan · ‎2016-01-27

Posted on January 27, 2016 at 15:44

Don't forget, that these chips are SoC, and what ARM promises is at the processor's interface. Whatever is beyond that - and load/store is a prime example of ''beyond the processor, as it is sure to access the ''rest of the chip'' - is up to the chipmaker (okay, the first thing which slows you down is the AHB bus which is still ARM, but a different ''product'' of theirs). The delays/waitstates imposed by various components then add up in rather a complex manner.

Attached a crude benchmark highlighting differences between accessing various memory areas. This run on an STM32F407 with clocks in reset state (i.e. APBx=AHB=16MHz HSI, ''plain'' FLASH interface). While the timing reset/read in C is a ''compiler-unsafe'' method, I checked manually and it appears to follow the same pattern in all cases (and you can check yourself, binary/disasm attached). I am not going to attempt to explain or discuss the results.

Welcome to the joys of 32-bit computing! ;)

JW

TEST_NOP, // 8 // nop instead of load

TEST_SRAM1, // 9 // load from SRAM1

TEST_SRAM2, // 10 // load from SRAM2

TEST_SRAM1_alt_address, // 11 // load from SRAM1 mapped to 0

TEST_CCRAM, // 10 // load from CCRAM

TEST_FLASH, // 9 // load from FLASH

TEST_SRAM1_3, // 11 // 3 consecutive loads from SRAM1

TEST_SRAM2_3, // 12 //

3 consecutive loads from SRAM2

TEST_CCRAM_3, // 12

// 3 consecutive loads from CCRAM

the tests are rougly:

*DWT_CYCCNT = 0;

__asm(

''movw r1, #0x0000 \n\t'' // low half of address

''movt r1, #0x2000 \n\t'' // high half of address

''ldr r0, [r1] \n\t'' // perform the memory load itself

''nop \n\t'' // some two nop-s

''nop \n\t''

);

timA[i] = *DWT_CYCCNT;

________________

Attachments :

cl.zip : https://st--c.eu10.content.force.com/sfc/dist/version/download/?oid=00Db0000000YtG6&ids=0680X000006I0wn&d=%2Fa%2F0X0000000bh7%2FbV8E3CBJY7re4WQ3bsxJQ1HhMRDxKzPax8_9JNpqBXk&asPdf=false