cancel
Showing results for 
Search instead for 
Did you mean: 

STM32N657: how to measure execution time with cycle accuracy

acapola
Associate III

the best I achieve so far using DWT as well as PMU is 'few cycles' accuracy. My question is, how I measure execution time such that I get 1 cycle difference when I insert a nop in the measured code ? (assuming interrupts and caches are disabled).

Both functions below return 1!

.global test_pmu0
.type test_pmu0, %function
.align 2
test_pmu0:
	isb	sy
	dsb	sy
	ldr	r3, .test_pmu0.pmu_base
	ldr	r0, [r3, #0x7C] //read PMU->CCNTR
	ldr	r1, [r3, #0x7C] //read PMU->CCNTR
	sub r0,r1,r0
	bx lr
.align 2
.test_pmu0.pmu_base:
	.word	0xE0003000 //PMU base address
.global test_pmu1
.type test_pmu1, %function
.align 2
test_pmu1:
	isb	sy
	dsb	sy
	ldr	r3, .test_pmu1.pmu_base
	ldr	r0, [r3, #0x7C] //read PMU->CCNTR
	nop
	ldr	r1, [r3, #0x7C] //read PMU->CCNTR
	sub r0,r1,r0
	bx lr
.align 2
.test_pmu1.pmu_base:
	.word	0xE0003000 //PMU base address

 

3 REPLIES 3
mbarg.1
Senior III

I always have my oscilloscope on my bench and toggling a pin is best marker for interval measurement.

You can add a counter in interval measurement mode, but is not a usual tool for informatics.

Last you can trigger an interrrupt and count ticks  in between - less accurate but more informatic ..

Andrew Neil
Super User

@acapola wrote:

how I measure execution time such that I get 1 cycle difference when I insert a nop in the measured code ?


You're working on a false assumption there - Inserting a NOP does not necessarily cause a 1-cycle difference:

AndrewNeil_0-1769703991796.png

https://www.st.com/resource/en/programming_manual/pm0273-stm32-cortexm55-mcus-programming-manual-stmicroelectronics.pdf#page=392

 

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.
TDK
Super User

You can't. The Cortex-M55 has a complex pipeline. Instructions are not completed in serial and in isolation--nearby instructions affect how fast the others go. This is in contrast to something like the Cortex-M4 where instructions do have cycle counts.

If the goal is to optimize a piece of code, profile a chunk that takes some nontrivial amount of time. Then change it and re-profile. Compare the delta. That's the proper approach to optimizing code on platforms like this.

Arm-Cortex-M55-Processor-Datasheet.pdf

https://documentation-service.arm.com/static/61952957f45f0b1fbf3a89e4

 

If you feel a post has answered your question, please click "Accept as Solution".