LDR and STR asm instruction different execution time?

PGogo · ‎2018-11-19

Hello. I am using STM32F4279ZI MCU with FreeRTOS. I want to compare efficiency of bare-metal and OS based program implementation. While doing some tests I have noticed that the same function called from bare-metal aplication is much slower than that called from FreeRTOS despite the assembler code is the same (I turned off all compiler optimalization).

After more detailed tests I have noticed that LDR and STR asm instructions execute 1 cpu cycle longer in function called from main() function than that one called is FreeRTOS task.

I measure execution time in cpu cycles with DWT->CYCCNT register.

Do you know or do you have any idea what is the reason of this difference?

Here is my example code. I deleted all unnecessary code:

void foo()
{
	uint32_t j = 0;
	uint32_t i;;
	DWT->CYCCNT = 0;
	i = 0;
	for (; i < 10000; i++)
	{
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
	}
	j = DWT->CYCCNT;
	printf("%d\r\n", j);
}
 
void task(void* param)
{
	foo();
	for (;;)
	{
	};
}
 
int main(void)
{
	foo();
 
	xTaskCreate(task, "task", 100, NULL, 1, NULL);
	vTaskStartScheduler();
}

waclawek.jan · ‎2018-11-21

> Is any documentation about it?

I know of none. I attempted some "benchmarking" back then too; it indicated that SRAM2 in 'F407 as compared to SRAM1 has a penalty of 1 cycle for the first consecutive access (this might be related to the way how its bus arbiter locks/moves around arbitration, this is an intimate detail of the bus matrix implementation/setup which is very unlikely be shared by ST). I guess the same may apply for SRAM3 in 'F42x/43x.

~~https://community.st.com/s/question/0D50X00009hnJN6SAM/how-many-cookies-to-feed-st~~

https://community.st.com/s/question/0D50X00009XkedySAB/reproducing-loadstore-timings-claimed-by-arm-on-stm32f4

These chips are very, very complex and there might be other mechanisms impacting timig involved.

JW

View solution in original post

Tesla DeLorean · ‎2018-11-19

Alignment of code, branch targets, data or stack?

Code already in ART cache?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

PGogo · ‎2018-11-19

Thank you for the reply. I'm going to check all your propositions.

Sorry for my question. How can I set code alignment?

Tesla DeLorean · ‎2018-11-19

I don't do assembler in-line, I use .s files and Keil, there you can define the natural alignment of the section, and use the ALIGN directive to assert that.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

waclawek.jan · ‎2018-11-19

> I have noticed that LDR and STR asm instructions execute 1 cpu cycle longer in function called from main() function than that one called is FreeRTOS task.

How does the above code demonstrate this?

JW

PGogo · ‎2018-11-20

It isn't in this example. To check this I used:

        DWT->CYCCNT = 0;
	asm("ldr     r3, [r3, #4]");
	printf("ldr %d\r\n", DWT->CYCCNT);
	DWT->CYCCNT = 0;
	asm("str     r3, [r7, #0]");
        printf("str %d\r\n", DWT->CYCCNT);

At the begining of foo function

waclawek.jan · ‎2018-11-20

This test proves nothing, as you don't have control over the content of registers you used to address (r3 in first instance and r7 in second). Time needed to access memory depends on that memory, on the bus it sits on, and with writes also on the state of the processor's write buffer. And the execution itself depends on the state of the instructions pipeline, all the way from FLASH/RAM through the processor's interfaces, the processor's pipeline, to the execution unit of processor.

This also might've quite well ended up with a crash, C does not expect you change registers and memory content behind its back.

JW

PGogo · ‎2018-11-20

Ok. Thank you for an explanation. However, why that foo function in my first post takes 220524

cpu cycles when calling from main while in task it takes 200704 cpu cycles?

On the other way when i measure time of 'i++" operation it executes 10 cpu cycles vs 8 cpu cycles (faster in RTOS)

waclawek.jan · ‎2018-11-20

To tell the reason it would require closer examination, but one explanation might be that the global stack is allocated in a slower memory than the processes' stack.

JW

PGogo · ‎2018-11-21

Is any documentation about it? When I take a look at addreses in 'normal' function they are like:

2002ffd8

2002ffdc

2002ffd4

and variables' adresses when foo() is called from task:

20000370

20000374

2000036c

so you might be right.

Is any possibility to allocate global stack in the same memory location in which FreeRTOS allocates?