cancel
Showing results for 
Search instead for 
Did you mean: 

STM32F407 DSP - execution speed, clock and cycle definition

Mario Simunic
Associate III

Hi,

I need to calculate DSP speed for STM32F407, but I can't find ARM documentation on DSP extension of Cortex-M4 nor any documentation of DSP extension implementation in STM32F4.

I know that Cortex-M4 have pipeline with 3 stage + branch speculation.

So, if we disregard pipeline, worst case time it takes for one instruction to complete is 4 periods of master clock. Right? On average, due to pipeline, this time approaches to one period of master clock. Right?

But, is the same truth for DSP instructions? - I think that it's not, but I'm not sure.

ARM site only says, quote: "Single cycle dual 16-bit MAC". But what is one cycle in this situation? Is it one cycle of master clock, one DSP instruction cycle or something else?

How is this "Cycle" defined?

Basically, I want to know how much 16bit MAC instructions per second can be executed by STM32F407 working at certain clock (master clock).

By the way, on which bus is DSP extension connected in STM32F407?

Thank you very much.

1 ACCEPTED SOLUTION

Accepted Solutions

> I need to calculate DSP speed

> Would suggest you use the DWT CYCCNT register to benchmark the throughput of code sequences in core cycles, ie 168 MHz ticks.​

+1

Execution timing is a very, very complex issue with many diverse inputs.

> DSP extension

> By the way, on which bus is DSP extension connected in STM32F407?

There is no DSP "extension". There are instructions which are said to be DSP-oriented.

> any documentation

Cortex-M4 Technical Reference Manual, and ARMv7M Architecture Reference Manual. Won't give you links; they are available at ARM's webpage which is an utter mess; search yourself.

> branch speculation

No, CM4 has no speculation. There are some things in the promo materials which are simply not true, and then there are optional features which the implemeters (chipmakers) may or may not implement.

> what is one cycle in this situation?

In CM4, processor clock = system clock = AHB clock = HCLK. This may be different in different ARMs, but here it holds.

> So, if we disregard pipeline, worst case time it takes for one instruction to complete is 4 periods of master clock.

No. That is only if the pipeline is empty (i.e. after jump), only if the instruction takes 1 cycle to complete (most instructions), only if it fetches from a memory with 1-cycle speed (e.g. RAM - FLASH is said to be accelerated but that's... complicated, and the 0WS is simply a marketing lie), only if the instruction does not wait for external input (i.e. a load), only if the instruction does not wait for output (i.e. if it's a save and the output buffer is full or switched off...) etc. etc. etc.

JW

View solution in original post

5 REPLIES 5
Mario Simunic
Associate III

Also, I would appreciate link to some literature or documentation regarding My question.

Would suggest you use the DWT CYCCNT register to benchmark the throughput of code sequences in core cycles, ie 168 MHz ticks.​

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

> I need to calculate DSP speed

> Would suggest you use the DWT CYCCNT register to benchmark the throughput of code sequences in core cycles, ie 168 MHz ticks.​

+1

Execution timing is a very, very complex issue with many diverse inputs.

> DSP extension

> By the way, on which bus is DSP extension connected in STM32F407?

There is no DSP "extension". There are instructions which are said to be DSP-oriented.

> any documentation

Cortex-M4 Technical Reference Manual, and ARMv7M Architecture Reference Manual. Won't give you links; they are available at ARM's webpage which is an utter mess; search yourself.

> branch speculation

No, CM4 has no speculation. There are some things in the promo materials which are simply not true, and then there are optional features which the implemeters (chipmakers) may or may not implement.

> what is one cycle in this situation?

In CM4, processor clock = system clock = AHB clock = HCLK. This may be different in different ARMs, but here it holds.

> So, if we disregard pipeline, worst case time it takes for one instruction to complete is 4 periods of master clock.

No. That is only if the pipeline is empty (i.e. after jump), only if the instruction takes 1 cycle to complete (most instructions), only if it fetches from a memory with 1-cycle speed (e.g. RAM - FLASH is said to be accelerated but that's... complicated, and the 0WS is simply a marketing lie), only if the instruction does not wait for external input (i.e. a load), only if the instruction does not wait for output (i.e. if it's a save and the output buffer is full or switched off...) etc. etc. etc.

JW

TDK
Guru

Instruction times are here. DSP instruction times are at the bottom:

https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Instruction-set-summary/Cortex-M4-instructions

If you feel a post has answered your question, please click "Accept as Solution".

Note that these are the ideal execution times, assuming perfect (0WS) memories/peripherals attached, with no other system interaction (e.g. no other busmasters competing for memories/peripherals, e.g. debug, DMA).

JW