prefetch queue, branch cache question

alandras · ‎2007-09-25

Posted on September 25, 2007 at 23:39

alandras · ‎2011-05-17

Posted on May 17, 2011 at 09:47

Hi,

I was trying to optimize the interlocks of a tight loop and ended up with this basic experiment:

if PFQBC disabled:

1: subs r0,#4

bpl 1b

gives 12 cycle per iteration, and

1: nop

subs r0,#4

bpl 1b

gives 13 cycle.

if PFQBC enabled:

1: subs r0,#4

bpl 1b

gives 5 cycle/iteration

but surprise,

1: nop

subs r0,#4

bpl 1b

gives 10 cycles.

Additional nops increment the cycles by 1, so I guess my cycle counting is Ok.

Can somebody explain me what is happening?

thanks a lot,

Andras

alandras · ‎2011-05-17

Posted on May 17, 2011 at 09:47

could be this issue?

http://www.embeddedrelated.com/groups/lpc2000/show/20486.php

Quote:

The STR9 is faster when it can execute at 96MHz in a straight line as the

flash can feed the CPU on each cycle. However, when a branch is taken you

need the ARM9 core to refill its pipline, but to do that it needs

instructions. The *first* instruction is provided by the branch cache but

the PFQ is empty--it burst fills at the full 96MHz rate but the core is left

waiting whilst it does so. We did some extensive benchmarking to

characterize this because, as I said, we were astounded by the reports

coming back from potential customers of the STR9 which asked us why it was

so slow on their applications. The information was fed back to ST who

acknowledged that there is an issue and that it will be fixed in the next

rev, but all that will happen is that the BTC is increased in size IIRC.

--

Paul Curtis, Rowley Associates Ltd

http://www.rowley.co.uk

Noticed the *first*?

This could explain the 5 cycle instead of theoretically 4 cycle, but still don't get-it with the 10 cycle for a nop,dec,branch !!!

Is this fixed in the FA devices?

thanks for any input,

Andras