2023-02-03 05:59 PM
Hello,
I am looking for a list of cycle counts for each instruction in the Cortex-M33 processor. For the Cortex-M4 processor, the ARM Technical Reference Manual includes a table of cycle counts for each instruction. The M7 ARM TRM doesn't seem to have this table and neither does the M33.
Does anyone know where to find a table of cycle counts? Is it maybe specific to a customer's design so would be different for an STM32 U5 compared to some other M33 implementation? Google has been no help at all. It seems like this should be something easily findable.
Thanks,
TG
2023-02-04 04:34 AM
Cortex-M33 is an implementation of ARM v8-M.
There's no cycle count in the v8-M ARM either, but does it really matter? Vast majority of instructions are single-cycles, except that the total cycle count is always higher and dominated by the very complex system influence.
If you are after some time-critical piece of code, you may be better off by determining it experimentally.
Don't get me wrong, I consider this insufficient documentation and bad practice; but it is what it is. ARM and ST surely do have a cycle-perfect simulator, and may be willing to fire it up for you if you represent a significant incentive for them, as expressed in $M++ of purchases.
These are not micro*controllers* anymore, but SoC, slapped together quite haphazardly from components, glued together by buses and some other logic. This is the price of complexity and performance.
JW
2023-02-04 01:35 PM
Are the majority of instructions single-cycle...? I don't know, since there's no document listing the instruction counts! (Sarcasm... yes, I agree they mostly are and for most cases it doesn't matter.)
I had already looked through the v8-M documents and all other documents I could find from ARM - they are very generic and just define the architectural behaviour and not the implementation-specific points like whether or not your MCU even has an FPU.
This is for time-critical code that has to perform some heavy math on realtime ADC buffers using both the FPU and integer units. It started as C code but has been further optimized in assembly. The performance is OK but doesn't leave much time left for the CPU to do other things, and I'm at the limit of where I can get it without knowing more about things like some of the gems in the M4 manual - i.e. FPU ops taking an extra cycle if their result is consumed by the next instruction, or that the addend of VFMA is used not until a cycle later, or that VDIV/VSQRT take 14 cycles but if you can find 14 integer instructions to run after the VSQRT then they will all complete ahead of the VSQRT making the VSQRT nearly free, etc... Obviously I have avoided using any VDIV/VSQRT but if I could replace a bunch of math with a single VSQRT (i.e. compute cos as sqrt(1 - sin^2)) and then put in N cycles of integer instructions (which I have to do at some point anyways, but would require significant code rearrangement to accomplish), then I might be able to get a big improvement. But is N still 14 on M33? I mean, probably it is otherwise they would have marketing about "accurate single-precision square roots in 5 cycles", but who really knows...
I've been profiling using the DWT cycle counter and it has helped tremendously to get where we are but it has come to the point where it isn't "minor tweaks" that will save cycles but perhaps larger algorithm changes informed by the CPU instruction stream behaviours. Trying to navigate that blindly without any sort of instruction reference is tough!
TG
2023-02-04 02:01 PM
Well the pipe-lining is designed to hide a lot of the actual execution time, they really focus on starting a new instruction on each cycle, ie through-put
Can't say I've found good documentation, or tools that decompose the stream.