Cortex M7 CPI

Leopold N. · ‎2019-10-10

Hello,

i would like to roughly estimate the time the processor needs to execute some pieces of code.

I know there is the DWT Cyclecounter, but the values it shows vary from one time to another at the same instruction. Also there is no information from ARM about the CPI (i think its cause the pipeline is quite different from Cortex M3 and M4).

I just want understand how the core works in general so i can optimize my code to execute faster in some critical routines.

Has anyone some information because i didnt find very much in the internet.

Greetings

waclawek.jan · ‎2019-10-11

They indeed removed this information from the Instruction set summary chapter in the Cortex-M7 TRM. ARM is a british company and this goes in line with traditions such as when Rolls-Royce did not publish their (car) engines power output, stating it only as "adequate". You can look it up in the Cortex-M3 TRM, it certainly did not change that much.

But the raw cycle count - which is 1 for the vast majority of instructions - is almost completely irrelevant in light of

the processor is superscalar (in somehow asymmetric way)
there are pipeline-level optimizations like branch prediction and (probably) NOP-skipper
the caches are only superficially described and generally impact of caches on programs in real-world setting is next to impossible to model precisely
there are latencies on the AXI and AHB bus matrices which are poorly or not at all described
there are poorly or not at all described latencies in the system peripherals - bridges between bus matrices, memories and peripherals themselves

Cycle counting is a next to impossible task on the 'M3/M4; virtually every feature in 'M7 aimed at speeding up things (except the raw clock speed) result in more and more complex factors introduced to 'M7. Face it: the 32-bitters are *not* and never have been micro*controllers* in regard of *control*. So just lean back and enjoy the speed and the glossy ads.

JW

Leopold N. · ‎2019-10-11

I used STM32F103 with Cortex M3 before and cyclecounting was quite okay with it. So im a bit disappointed that i now just cant calculate even a bit :(

Well then i will just do some try and error and see whats best.

Thank you

Tesla DeLorean · ‎2019-10-11

Your option more generally is to test larger blocks of code over multiple iterations and use that to determine a throughput number.

You want to avoid situations where the next instruction is dependent on the results of the current one, for example and look at alignment of loop or branch targets.

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..

Leopold N. · ‎2019-10-11

How do i align loops or branch targets?

I ask because the assembler/linker converts my code and places it. Should i go through the code and align every Line by hand?

waclawek.jan · ‎2019-10-11

In gcc, there is group of switches -falign-xxxxx, which supposedly do this.

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-falign-functions

I have absolutely no experience with these.

JW

Tesla DeLorean · ‎2019-10-11

Well you have to tell the assembler what alignment to use on a segment, and then identify where your algorithms loop heavily. Use ALIGN in front of them.

Not sure you'd need to do it line-by-line, but you'd really want some grasp about where in your code it is going to spend most of its time, and which are critical. A thoughtful approach would be to write scripts to go through map and listing files to identify things, tools outputting cross referencing or static analysis being particularly helpful.

Dynamically people use profiling and trace tools.

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..