2007-09-25 02:39 PM
prefetch queue, branch cache question
2011-05-17 12:47 AM
Hi,
I was trying to optimize the interlocks of a tight loop and ended up with this basic experiment: if PFQBC disabled: 1: subs r0,#4 bpl 1b gives 12 cycle per iteration, and 1: nop subs r0,#4 bpl 1b gives 13 cycle. if PFQBC enabled: 1: subs r0,#4 bpl 1b gives 5 cycle/iteration but surprise, 1: nop subs r0,#4 bpl 1b gives 10 cycles. Additional nops increment the cycles by 1, so I guess my cycle counting is Ok. Can somebody explain me what is happening? thanks a lot, Andras2011-05-17 12:47 AM
could be this issue?
http://www.embeddedrelated.com/groups/lpc2000/show/20486.phpQuote:
The STR9 is faster when it can execute at 96MHz in a straight line as the
flash can feed the CPU on each cycle. However, when a branch is taken you need the ARM9 core to refill its pipline, but to do that it needs instructions. The *first* instruction is provided by the branch cache but the PFQ is empty--it burst fills at the full 96MHz rate but the core is left waiting whilst it does so. We did some extensive benchmarking to characterize this because, as I said, we were astounded by the reports coming back from potential customers of the STR9 which asked us why it was so slow on their applications. The information was fed back to ST who acknowledged that there is an issue and that it will be fixed in the next rev, but all that will happen is that the BTC is increased in size IIRC. -- Paul Curtis, Rowley Associates Ltd Noticed the *first*? This could explain the 5 cycle instead of theoretically 4 cycle, but still don't get-it with the 10 cycle for a nop,dec,branch !!! Is this fixed in the FA devices? thanks for any input, Andras