cancel
Showing results for 
Search instead for 
Did you mean: 

Clock speed, instruction timing not squaring up

BRP
Associate II

Hi all,

I'm experimenting on an STM32F723disco board, trying out some audio DSP. I seem to be getting abysmal processing speed.

So as a first sanity check I did a simple nop timing test:

  • Turn on a GPIO pin, do 50 nops, turn the GPIO pin off. (By 50 nops, I mean 50 lines saying asm("nop") )
  • Turn the GPIO pin on again, do 100 nops, turn the GPIO pin off.

This gives me two pairs of pulses, the difference in length of which should correspond to the 50 extra nops the second time around. Result: it seems a nop takes around 9.6ns, which is two clock cycles.

Curiously, ART wasn't enabled, but setting the ARTEN bit had no impact anyway.

I must be doing something terribly wrong.

Cheers,

B.

1 ACCEPTED SOLUTION

Accepted Solutions
BRP
Associate II

Looks like I found it. The default settings when you start a project has ART, Prefetch, DCache and ICache disabled. I haven't tested all combos but it seems ICache was the one that made the difference.

Without it, a nop averages 1.6 cycles. With it, 0.55 cycles. Presumably I'd get precisely 0.5 running out of ITCM but I've yet to work up the courage to try that.

View solution in original post

5 REPLIES 5
tjaekel
Lead

Sorry, what is your question?

When you do 100 NOPs vs. 50 NOPs - you should see a difference in the pulse period.

But if you do NOPs with ICache enabled - there might be a difference in the duration.

A NOP is one instruction per code fetch. I assume, it can result in 1 clock cycle. But tough to tell, esp. if ICache is enabled, the ARM has a prefetch unit and instruction pipeline.

Possible you see as "a NOP takes two clock cycles". And now which question to answer?

The ART sounds to me like an "additional prefetching", e.g. when reading instructions from external memory, for internal flash etc.

But when it was fetched already and the ICache is on: the NOPs should come from there (so, the ART does not matter so much, just for the first few instructions).

 

 

With a 216MHz clock I'm expecting the "100 nop" pulse to be 231ns longer than the "50 nop" pulse. In reality, the length difference is 480 ns. So, 50 nops take 480 ns, That is what I am trying to understand.

I've just not been able to do *any* test that confirms 216 MIPS operation, not even this one consisting of nops only.

(edit: if you've read this before this edit, corrected numbers I'd mistyped)

BRP
Associate II

Looks like I found it. The default settings when you start a project has ART, Prefetch, DCache and ICache disabled. I haven't tested all combos but it seems ICache was the one that made the difference.

Without it, a nop averages 1.6 cycles. With it, 0.55 cycles. Presumably I'd get precisely 0.5 running out of ITCM but I've yet to work up the courage to try that.

Great.
This makes sense to me: without ICache enabled - you cannot assume 1 or 2 clock cycles for a NOP. The speed is set by the "Flash_Latency".

BTW: if you use a LOOP, e.g. FOR, for 50 vs. 100 times (instead of 100 NOPs coded as a sequence of 100) - you have overhead for the LOOP check: a NOP is now longer due to additional instruction(s) for the loop iterations.

Hi,

I think it's odd that ST doesn't turn on all the bells and whistles by default but I'm happy I worked it out.

The other thing I hadn't seen coming (but in retrospect might have) is that nops get dual-issued, hence the half cycle per instruction with Icache on.

The nops are in one continuous string, no loops. And the differential measurement (e.g. 50 vs 100 nops) was done to insure that the pipeline contains nothing else but nops either way. So I'm fairly certain these timings are actual net values.

And now back to the real work - see if I can get a FIR filter running at one MAC per clock cycle...

Cheers!

B.