Instruction Timings not matching ARM Cortex-M4 Manual

Timon1 · ‎2024-08-13

I am working with an STM32F3, a ChipWhisperer and the SimpleSerial library right now and trying to follow instructions executed. While doing so, I noticed that the number of CPU cycles does not seem to match what I expected from the ARM Cortex-M4 instruction timings. For example, I took a base code, subsequently added instructions and each time measured the number of cycles. My measurements are as following:

instruction | cycles

------------------------

ldrb r9,%[z] | 102

add r4,#194 | 104

mul r8,r12 | 105

eor r5,#198 | 107

strb r11,%[z] | 108

cycles is the number of cycles measured for all instructions added. add seems to take two cycles since ldrb gets pipelined with an instruction after relevant instructions if the add is not there.

What baffles me is the eor instruction. By my measurements, it takes two cycles to complete. Contrary to that, the ARM Cortex-M4 manual explicitly states that an eor instruction only takes one cycle.

Am I overlooking something or does the STM32F3 include something not mentioned in the ARM Cortex-M4 manual that can mess with the cycles?

The attachment contains the base code with all the instructions added between the lines 59 and 258. The Instructions above are between the lines 124 and 128.

Thanks in advance

AScha.3 · ‎2024-08-13

What optimization did you set? Try -O2 , then check again.

If you feel a post has answered your question, please click "Accept as Solution".

Timon1 · ‎2024-08-14

The optimization is set to -Os. According to the docs, it is not a real subset of -O2 but it seems to be close enough in my case. Additionally, I manually verified the instructions in the hex file uploaded to the STM32 and they seemed correct.
I can try it with -O2 and -O0 tomorrow but I don't think it's the compiler.

AScha.3 · ‎2024-08-14

Optimization is fastest on -Ofast and O2 I found not much difference, just some mix fast and code size optimization.

Try... to see, if it's better.

If you feel a post has answered your question, please click "Accept as Solution".

STOne-32 · ‎2024-08-14

Dear @Timon1 ,

Thank you for the interesting post , here is what I interpret your results :

instruction | cycles

------------------------

ldrb r9,%[z] | 102 => 2cycles

add r4,#194 | 104 => 1cycle

mul r8,r12 | 105 => 2cycles

eor r5,#198 | 107 => 1cycle

strb r11,%[z] | 108 => ?cycles

let me know if I missed something , then I would recommend to run at 24MHz from Flash ( set 0 wait states ) . Or to use CCM RAM for code available on some F3 series , look at this application note

https://www.st.com/resource/en/application_note/an4296-use-stm32f3stm32g4-ccm-sram-with-iar-embedded-workbench-keil-mdkarm-stmicroelectronics-stm32cubeide-and-other-gnubased-toolchains-stmicroelectronics.pdf

basically to keep all system at 0 wait state , deterministic and use of full feature of M4 core :

I bus and D-bus on perfect memory

S bus on a perfect RAM different from the first one to avoid bottlenecks on the memory side .

Hope it helps ,

STOne-32

Tesla DeLorean · ‎2024-08-14

The F1, F0 and F3 all expose the relative slowness of the FLASH, in the order of 35-42ns, regardless of core frequency.

Later models used "ART" to cache the front end of the FLASH array, using the flash-line width to preload, and then using an expedited prefetch data path to service within the current cycle, ie faster than 0-wait.

The CCM memory can execute code via the TCM, so effectively 1-cycle, 0-wait for all transactions. Good for stack if you don't want to DMA into it.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

gregstm · ‎2024-08-14

Not sure if these things will help, but here are things you could try (for your timing test):

- make sure all your instructions are 32 bit wide (eg. use ldrb.w, eor.w add.w - not sure if you can force the "mul" command to be 32 bits wide, but you could comment it out during your timing tests )

- make sure all these 32 bit instructions are aligned to a 4 byte boundary (eg. use "ALIGN 4" at the start)

- refer to the "ARM v7-M Architecture Reference Manual" and

"Arm Cortex-M4 Processor Technical Reference Manual" for more information

Timon1 · ‎2024-08-21

I tried -O2 and got the same results

Timon1 · ‎2024-08-21

According to the adresses listed in the listing file I get from the compiler, the instructions are written to the flash (adresses in the 0x8000000 range). I also checked FLASH_ACR and the three least significant bits are 0. According to the reference manual, this means that the chip is running with 0 wait states.

When executing the program from the CCM SRAM as described in section 4.1 in the document, the writes to the GPIOs don't seem to work anymore, which is problematic since I measure the cycles using the ChipWhisperer and artificial triggers. I'm not sure if it is because of different adresses or if it is actually impossible.

Timon1 · ‎2024-08-21

- All these instructions already have the w suffix in the listings file generated by the compiler

- The listings file shows they are already aligned

- I have looked into both manuals but have not found anything that would explain it to me. I may have overlooked something but I have looked into all relevant sounding sections