2018-11-19 03:15 PM
Hello. I am using STM32F4279ZI MCU with FreeRTOS. I want to compare efficiency of bare-metal and OS based program implementation. While doing some tests I have noticed that the same function called from bare-metal aplication is much slower than that called from FreeRTOS despite the assembler code is the same (I turned off all compiler optimalization).
After more detailed tests I have noticed that LDR and STR asm instructions execute 1 cpu cycle longer in function called from main() function than that one called is FreeRTOS task.
I measure execution time in cpu cycles with DWT->CYCCNT register.
Do you know or do you have any idea what is the reason of this difference?
Here is my example code. I deleted all unnecessary code:
void foo()
{
uint32_t j = 0;
uint32_t i;;
DWT->CYCCNT = 0;
i = 0;
for (; i < 10000; i++)
{
asm("NOP");
asm("NOP");
asm("NOP");
asm("NOP");
asm("NOP");
asm("NOP");
asm("NOP");
asm("NOP");
asm("NOP");
}
j = DWT->CYCCNT;
printf("%d\r\n", j);
}
void task(void* param)
{
foo();
for (;;)
{
};
}
int main(void)
{
foo();
xTaskCreate(task, "task", 100, NULL, 1, NULL);
vTaskStartScheduler();
}
Solved! Go to Solution.
2018-11-21 04:18 AM
> Is any documentation about it?
I know of none. I attempted some "benchmarking" back then too; it indicated that SRAM2 in 'F407 as compared to SRAM1 has a penalty of 1 cycle for the first consecutive access (this might be related to the way how its bus arbiter locks/moves around arbitration, this is an intimate detail of the bus matrix implementation/setup which is very unlikely be shared by ST). I guess the same may apply for SRAM3 in 'F42x/43x.
https://community.st.com/s/question/0D50X00009hnJN6SAM/how-many-cookies-to-feed-st
https://community.st.com/s/question/0D50X00009XkedySAB/reproducing-loadstore-timings-claimed-by-arm-on-stm32f4
These chips are very, very complex and there might be other mechanisms impacting timig involved.
JW
2018-11-19 03:51 PM
Alignment of code, branch targets, data or stack?
Code already in ART cache?
2018-11-19 04:21 PM
Thank you for the reply. I'm going to check all your propositions.
Sorry for my question. How can I set code alignment?
2018-11-19 04:31 PM
I don't do assembler in-line, I use .s files and Keil, there you can define the natural alignment of the section, and use the ALIGN directive to assert that.
2018-11-19 10:08 PM
> I have noticed that LDR and STR asm instructions execute 1 cpu cycle longer in function called from main() function than that one called is FreeRTOS task.
How does the above code demonstrate this?
JW
2018-11-20 01:24 AM
It isn't in this example. To check this I used:
DWT->CYCCNT = 0;
asm("ldr r3, [r3, #4]");
printf("ldr %d\r\n", DWT->CYCCNT);
DWT->CYCCNT = 0;
asm("str r3, [r7, #0]");
printf("str %d\r\n", DWT->CYCCNT);
At the begining of foo function
2018-11-20 03:47 AM
This test proves nothing, as you don't have control over the content of registers you used to address (r3 in first instance and r7 in second). Time needed to access memory depends on that memory, on the bus it sits on, and with writes also on the state of the processor's write buffer. And the execution itself depends on the state of the instructions pipeline, all the way from FLASH/RAM through the processor's interfaces, the processor's pipeline, to the execution unit of processor.
This also might've quite well ended up with a crash, C does not expect you change registers and memory content behind its back.
JW
2018-11-20 04:21 AM
Ok. Thank you for an explanation. However, why that foo function in my first post takes 220524
cpu cycles when calling from main while in task it takes 200704 cpu cycles?
On the other way when i measure time of 'i++" operation it executes 10 cpu cycles vs 8 cpu cycles (faster in RTOS)
2018-11-20 03:13 PM
To tell the reason it would require closer examination, but one explanation might be that the global stack is allocated in a slower memory than the processes' stack.
JW
2018-11-21 01:39 AM
Is any documentation about it? When I take a look at addreses in 'normal' function they are like:
2002ffd8
2002ffdc
2002ffd4
and variables' adresses when foo() is called from task:
20000370
20000374
2000036c
so you might be right.
Is any possibility to allocate global stack in the same memory location in which FreeRTOS allocates?