2024-02-01 12:58 AM
I now have two copies of basically identical code running in both MCUs. It was found that the two MCUs took different times to execute memset() of the same length.
Observe the running time by adding an IO flip before and after memset(). buffer is a global variable.
HAL_GPIO_WritePin(TEST_GPIO_Port, TEST_Pin, GPIO_PIN_RESET);
memset(buffer, 0, 256);
HAL_GPIO_WritePin(TEST_GPIO_Port, TEST_Pin, GPIO_PIN_SET);
lower is 6us, higher is 36us.(The loop runs with this value every time)
I can confirm that the frequency or settings of the two MCUs are consistent. And there are no operations like interrupts during running memset(). Micro used is STM32F746VGK. The library is libc_nano.a.
Any ideal? Is it related to memory alignment? thanks.
2024-02-01 01:11 AM
Hello;
@fkst wrote:
I now have two copies of basically identical code running in both MCUs. It was found that the two MCUs took different times to execute memset() of the same length.
Do you mean with the same MCU part number (STM32F746) you get two different timings with two devices?
2024-02-01 01:13 AM - edited 2024-02-01 01:14 AM
What optimizer setting you use ? + same clock, wait state, optimizer setting for both cpu ? (This will change the "speed" of the cpu.)
Try -O2 as a good standard setting.
2024-02-01 01:22 AM
yes
2024-02-01 01:27 AM
The optimization of both my devices is None(-O0) and use the same clock. My code does not support -O2 yet (running problem).
In addition, the running speed currently only differs in memset(), and other functions such as memcpy() have the same execution time.
2024-02-01 02:08 AM
Is it the same binary running on both devices or different applications? F7 having same board design?
2024-02-01 02:46 AM
What is the address of buffer in each case?
Check MPU and cache settings.
Failure at different optimization levels suggest other latent coding issues.
Watch for how local/auto variables are initialized. Clear them as this is not default behaviour.
Watch cache coherency if using DMA. On F7 use DTCMRAM for DMA where possible.
2024-02-01 02:52 AM
I see two hypotesis for the moment:
If different binaries (applications) but same board:
If different boards but the same binary:
2024-02-01 03:27 AM
@SofLit @Tesla DeLorean @AScha.3
I have now changed to a testing method, running the memset() that comes with the official library and the memset() found online in the same MCU for comparison. It is found that the execution time of the latter is much smaller than that of the former. This test can exclude the difference between the two MCUs.
The function is below:
void * memset(void * base_ptr, int x, size_t length) {
const uint32_t int_size = sizeof(uint32_t);
static_assert(sizeof(uint32_t) == 4, "only supports 32 bit size");
// find first word-aligned address
uint32_t ptr = (uint32_t) base_ptr;
// get end of memory to set
uint32_t end = ptr + length;
// get location of first word-aligned address at/after the start, but not
// after the end
uint32_t mid1 = (ptr + int_size - 1) / int_size * int_size;
if (mid1 > end) {
mid1 = end;
}
// get location of last word-aligned address at/before the end
uint32_t mid3 = end / int_size * int_size;
// get end location of optimized section
uint32_t mid2 = mid1 + (mid3 - mid1) / (4 * int_size) * (4 * int_size);
// create a word-sized integer
uint32_t value = 0;
for (uint16_t i = 0; i < int_size; ++i) {
value <<= 8;
value |= (uint8_t) x;
}
__ASM volatile (
// store bytes
"b Compare1%=\n"
"Store1%=:\n"
"strb %[value], [%[ptr]], #1\n"
"Compare1%=:\n"
"cmp %[ptr], %[mid1]\n"
"bcc Store1%=\n"
// store words optimized
"b Compare2%=\n"
"Store2%=:\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"Compare2%=:\n"
"cmp %[ptr], %[mid2]\n"
"bcc Store2%=\n"
// store words
"b Compare3%=\n"
"Store3%=:\n"
"str %[value], [%[ptr]], #4\n"
"Compare3%=:\n"
"cmp %[ptr], %[mid3]\n"
"bcc Store3%=\n"
// store bytes
"b Compare4%=\n"
"Store4%=:\n"
"strb %[value], [%[ptr]], #1\n"
"Compare4%=:\n"
"cmp %[ptr], %[end]\n"
"bcc Store4%=\n"
: // no outputs
: [value] "r"(value),
[ptr] "r"(ptr),
[mid1] "r"(mid1),
[mid2] "r"(mid2),
[mid3] "r"(mid3),
[end] "r"(end)
);
return base_ptr;
}
2024-02-01 03:29 AM
I would use the cycle counter, not rely on HAL GPIO settings.
Also, better turn off ISR execution.
/* CPU cycle count activation for debugging - STM32F767 */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->LAR = 0xC5ACCE55;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;
...
/* check function time with cycle counts */
__disable_irq();
u32CycFuncStart = DWT->CYCCNT;
FunctionUnderTest();
u32CycFuncStop = DWT->CYCCNT;
__enable_irq();