The execution time of memset() is different in different MCUs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 12:58 AM
I now have two copies of basically identical code running in both MCUs. It was found that the two MCUs took different times to execute memset() of the same length.
Observe the running time by adding an IO flip before and after memset(). buffer is a global variable.
HAL_GPIO_WritePin(TEST_GPIO_Port, TEST_Pin, GPIO_PIN_RESET);
memset(buffer, 0, 256);
HAL_GPIO_WritePin(TEST_GPIO_Port, TEST_Pin, GPIO_PIN_SET);
lower is 6us, higher is 36us.(The loop runs with this value every time)
I can confirm that the frequency or settings of the two MCUs are consistent. And there are no operations like interrupts during running memset(). Micro used is STM32F746VGK. The library is libc_nano.a.
Any ideal? Is it related to memory alignment? thanks.
- Labels:
-
STM32Cube MCU Packages
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 01:11 AM
Hello;
@fkst wrote:
I now have two copies of basically identical code running in both MCUs. It was found that the two MCUs took different times to execute memset() of the same length.
Do you mean with the same MCU part number (STM32F746) you get two different timings with two devices?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 01:13 AM - edited ‎2024-02-01 01:14 AM
What optimizer setting you use ? + same clock, wait state, optimizer setting for both cpu ? (This will change the "speed" of the cpu.)
Try -O2 as a good standard setting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 01:22 AM
yes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 01:27 AM
The optimization of both my devices is None(-O0) and use the same clock. My code does not support -O2 yet (running problem).
In addition, the running speed currently only differs in memset(), and other functions such as memcpy() have the same execution time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 02:08 AM
Is it the same binary running on both devices or different applications? F7 having same board design?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 02:46 AM
What is the address of buffer in each case?
Check MPU and cache settings.
Failure at different optimization levels suggest other latent coding issues.
Watch for how local/auto variables are initialized. Clear them as this is not default behaviour.
Watch cache coherency if using DMA. On F7 use DTCMRAM for DMA where possible.
Up vote any posts that you find helpful, it shows what's working..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 02:52 AM
I see two hypotesis for the moment:
If different binaries (applications) but same board:
- Different linker files: see from where you're executing (see AN4667)
- Cache enabled/disabled or ART enabled/Disabled in the two applications
- You don't have the same system clock config
- Different code optimization: Rejected in the previous comment
If different boards but the same binary:
- Issue on clock source. Try to output the system clock on one of the MCOx pins to check the frequency of the system on both devices. Is it the same?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 03:27 AM
@SofLit @Tesla DeLorean @AScha.3
I have now changed to a testing method, running the memset() that comes with the official library and the memset() found online in the same MCU for comparison. It is found that the execution time of the latter is much smaller than that of the former. This test can exclude the difference between the two MCUs.
The function is below:
void * memset(void * base_ptr, int x, size_t length) {
const uint32_t int_size = sizeof(uint32_t);
static_assert(sizeof(uint32_t) == 4, "only supports 32 bit size");
// find first word-aligned address
uint32_t ptr = (uint32_t) base_ptr;
// get end of memory to set
uint32_t end = ptr + length;
// get location of first word-aligned address at/after the start, but not
// after the end
uint32_t mid1 = (ptr + int_size - 1) / int_size * int_size;
if (mid1 > end) {
mid1 = end;
}
// get location of last word-aligned address at/before the end
uint32_t mid3 = end / int_size * int_size;
// get end location of optimized section
uint32_t mid2 = mid1 + (mid3 - mid1) / (4 * int_size) * (4 * int_size);
// create a word-sized integer
uint32_t value = 0;
for (uint16_t i = 0; i < int_size; ++i) {
value <<= 8;
value |= (uint8_t) x;
}
__ASM volatile (
// store bytes
"b Compare1%=\n"
"Store1%=:\n"
"strb %[value], [%[ptr]], #1\n"
"Compare1%=:\n"
"cmp %[ptr], %[mid1]\n"
"bcc Store1%=\n"
// store words optimized
"b Compare2%=\n"
"Store2%=:\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"str %[value], [%[ptr]], #4\n"
"Compare2%=:\n"
"cmp %[ptr], %[mid2]\n"
"bcc Store2%=\n"
// store words
"b Compare3%=\n"
"Store3%=:\n"
"str %[value], [%[ptr]], #4\n"
"Compare3%=:\n"
"cmp %[ptr], %[mid3]\n"
"bcc Store3%=\n"
// store bytes
"b Compare4%=\n"
"Store4%=:\n"
"strb %[value], [%[ptr]], #1\n"
"Compare4%=:\n"
"cmp %[ptr], %[end]\n"
"bcc Store4%=\n"
: // no outputs
: [value] "r"(value),
[ptr] "r"(ptr),
[mid1] "r"(mid1),
[mid2] "r"(mid2),
[mid3] "r"(mid3),
[end] "r"(end)
);
return base_ptr;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
‎2024-02-01 03:29 AM
I would use the cycle counter, not rely on HAL GPIO settings.
Also, better turn off ISR execution.
/* CPU cycle count activation for debugging - STM32F767 */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->LAR = 0xC5ACCE55;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;
...
/* check function time with cycle counts */
__disable_irq();
u32CycFuncStart = DWT->CYCCNT;
FunctionUnderTest();
u32CycFuncStop = DWT->CYCCNT;
__enable_irq();