cancel
Showing results for 
Search instead for 
Did you mean: 

LDR and STR asm instruction different execution time?

PGogo
Associate II

Hello. I am using STM32F4279ZI MCU with FreeRTOS. I want to compare efficiency of bare-metal and OS based program implementation. While doing some tests I have noticed that the same function called from bare-metal aplication is much slower than that called from FreeRTOS despite the assembler code is the same (I turned off all compiler optimalization). 

After more detailed tests I have noticed that LDR and STR asm instructions execute 1 cpu cycle longer in function called from main() function than that one called is FreeRTOS task.

I measure execution time in cpu cycles with DWT->CYCCNT register.

Do you know or do you have any idea what is the reason of this difference?

Here is my example code. I deleted all unnecessary code:

void foo()
{
	uint32_t j = 0;
	uint32_t i;;
	DWT->CYCCNT = 0;
	i = 0;
	for (; i < 10000; i++)
	{
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
		asm("NOP");
	}
	j = DWT->CYCCNT;
	printf("%d\r\n", j);
}
 
void task(void* param)
{
	foo();
	for (;;)
	{
	};
}
 
int main(void)
{
	foo();
 
	xTaskCreate(task, "task", 100, NULL, 1, NULL);
	vTaskStartScheduler();
}

13 REPLIES 13

> Is any documentation about it?

I know of none. I attempted some "benchmarking" back then too; it indicated that SRAM2 in 'F407 as compared to SRAM1 has a penalty of 1 cycle for the first consecutive access (this might be related to the way how its bus arbiter locks/moves around arbitration, this is an intimate detail of the bus matrix implementation/setup which is very unlikely be shared by ST). I guess the same may apply for SRAM3 in 'F42x/43x.

https://community.st.com/s/question/0D50X00009hnJN6SAM/how-many-cookies-to-feed-st

 https://community.st.com/s/question/0D50X00009XkedySAB/reproducing-loadstore-timings-claimed-by-arm-on-stm32f4

These chips are very, very complex and there might be other mechanisms impacting timig involved.

JW

PGogo
Associate II

Ok. I will try to locate global stack in the same memory area as FreeRTOS locates its own. If results will be equal in both implementations, your are right. If not, I will try to contact with ST or ARM.

Is that link correct? I don't see any connection with the topic.

PGogo
Associate II

@Community member​  Thank you for your help. The problem is in global stack localization. When i changed stack's address from 0x20030000 to 0x20010000 (from SRAM3 to SRAM1) executon times are equal. In STM32F427xx STM32F429xx Datasheet in 'Multi-AHB bus matrix' section is shown that SRAM1 is connected additionaly with core with D-BUS. SRAM2 and SRAM3 are connected only with S-BUS.

Sorry for the bad link.

https://community.st.com/s/question/0D50X00009XkedySAB/reproducing-loadstore-timings-claimed-by-arm-on-stm32f4

> SRAM1 is connected additionaly with core with D-BUS. SRAM2 and SRAM3 are connected only with S-BUS.

That does not explain why (and under which circumstances) SRAM2/SRAM3 are *slower*.

SRAM1 is accessed through the D/I ports when mapped to the 0x0000000 area.

JW