Instruction Timings not matching ARM Cortex-M4 Manual
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-13 11:03 PM
I am working with an STM32F3, a ChipWhisperer and the SimpleSerial library right now and trying to follow instructions executed. While doing so, I noticed that the number of CPU cycles does not seem to match what I expected from the ARM Cortex-M4 instruction timings. For example, I took a base code, subsequently added instructions and each time measured the number of cycles. My measurements are as following:
instruction | cycles
------------------------
ldrb r9,%[z] | 102
add r4,#194 | 104
mul r8,r12 | 105
eor r5,#198 | 107
strb r11,%[z] | 108
cycles is the number of cycles measured for all instructions added. add seems to take two cycles since ldrb gets pipelined with an instruction after relevant instructions if the add is not there.
What baffles me is the eor instruction. By my measurements, it takes two cycles to complete. Contrary to that, the ARM Cortex-M4 manual explicitly states that an eor instruction only takes one cycle.
Am I overlooking something or does the STM32F3 include something not mentioned in the ARM Cortex-M4 manual that can mess with the cycles?
The attachment contains the base code with all the instructions added between the lines 59 and 258. The Instructions above are between the lines 124 and 128.
Thanks in advance
- Labels:
-
Documentation
-
STM32F3 Series
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-13 11:22 PM
What optimization did you set? Try -O2 , then check again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-14 12:10 PM
The optimization is set to -Os. According to the docs, it is not a real subset of -O2 but it seems to be close enough in my case. Additionally, I manually verified the instructions in the hex file uploaded to the STM32 and they seemed correct.
I can try it with -O2 and -O0 tomorrow but I don't think it's the compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-14 12:42 PM
Optimization is fastest on -Ofast and O2 I found not much difference, just some mix fast and code size optimization.
Try... to see, if it's better.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-14 02:15 PM
Dear @Timon1 ,
Thank you for the interesting post , here is what I interpret your results :
instruction | cycles
------------------------
ldrb r9,%[z] | 102 => 2cycles
add r4,#194 | 104 => 1cycle
mul r8,r12 | 105 => 2cycles
eor r5,#198 | 107 => 1cycle
strb r11,%[z] | 108 => ?cycles
let me know if I missed something , then I would recommend to run at 24MHz from Flash ( set 0 wait states ) . Or to use CCM RAM for code available on some F3 series , look at this application note
basically to keep all system at 0 wait state , deterministic and use of full feature of M4 core :
I bus and D-bus on perfect memory
S bus on a perfect RAM different from the first one to avoid bottlenecks on the memory side .
Hope it helps ,
STOne-32
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-14 03:23 PM
The F1, F0 and F3 all expose the relative slowness of the FLASH, in the order of 35-42ns, regardless of core frequency.
Later models used "ART" to cache the front end of the FLASH array, using the flash-line width to preload, and then using an expedited prefetch data path to service within the current cycle, ie faster than 0-wait.
The CCM memory can execute code via the TCM, so effectively 1-cycle, 0-wait for all transactions. Good for stack if you don't want to DMA into it.
Up vote any posts that you find helpful, it shows what's working..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-14 06:32 PM
Not sure if these things will help, but here are things you could try (for your timing test):
- make sure all your instructions are 32 bit wide (eg. use ldrb.w, eor.w add.w - not sure if you can force the "mul" command to be 32 bits wide, but you could comment it out during your timing tests )
- make sure all these 32 bit instructions are aligned to a 4 byte boundary (eg. use "ALIGN 4" at the start)
- refer to the "ARM v7-M Architecture Reference Manual" and
"Arm Cortex-M4 Processor Technical Reference Manual" for more information
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-21 05:55 AM
I tried -O2 and got the same results
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-21 08:41 AM
According to the adresses listed in the listing file I get from the compiler, the instructions are written to the flash (adresses in the 0x8000000 range). I also checked FLASH_ACR and the three least significant bits are 0. According to the reference manual, this means that the chip is running with 0 wait states.
When executing the program from the CCM SRAM as described in section 4.1 in the document, the writes to the GPIOs don't seem to work anymore, which is problematic since I measure the cycles using the ChipWhisperer and artificial triggers. I'm not sure if it is because of different adresses or if it is actually impossible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-08-21 08:46 AM
- All these instructions already have the w suffix in the listings file generated by the compiler
- The listings file shows they are already aligned
- I have looked into both manuals but have not found anything that would explain it to me. I may have overlooked something but I have looked into all relevant sounding sections