cancel
Showing results for 
Search instead for 
Did you mean: 

The execution time of memset() is different in different MCUs

fkst
Associate II

I now have two copies of basically identical code running in both MCUs. It was found that the two MCUs took different times to execute memset() of the same length.

Observe the running time by adding an IO flip before and after memset(). buffer is a global variable.

 

 

 

HAL_GPIO_WritePin(TEST_GPIO_Port, TEST_Pin, GPIO_PIN_RESET);
memset(buffer, 0, 256);
HAL_GPIO_WritePin(TEST_GPIO_Port, TEST_Pin, GPIO_PIN_SET);

 

 

fkst_0-1706777099863.png

lower is 6us, higher is 36us.(The loop runs with this value every time)

I can confirm that the frequency or settings of the two MCUs are consistent. And there are no operations like interrupts during running memset(). Micro used is STM32F746VGK. The library is libc_nano.a.

Any ideal? Is it related to memory alignment? thanks.

 

10 REPLIES 10
SofLit
ST Employee

Hello;


@fkst wrote:

I now have two copies of basically identical code running in both MCUs. It was found that the two MCUs took different times to execute memset() of the same length.


Do you mean with the same MCU part number (STM32F746) you get two different timings with two devices?

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.
AScha.3
Principal III

What optimizer setting you use ?  + same clock, wait state, optimizer setting for both cpu ? (This will change the "speed" of the cpu.)

Try -O2 as a good standard setting.

If you feel a post has answered your question, please click "Accept as Solution".

yes

The optimization of both my devices is None(-O0) and use the same clock. My code does not support -O2 yet (running problem).

In addition, the running speed currently only differs in memset(), and other functions such as memcpy() have the same execution time.

Is it the same binary running on both devices or different applications? F7 having same board design?

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

What is the address of buffer in each case?

Check MPU and cache settings.

Failure at different optimization levels suggest other latent coding issues.

Watch for how local/auto variables are initialized. Clear them as this is not default behaviour.

Watch cache coherency if using DMA. On F7 use DTCMRAM for DMA where possible.

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..
SofLit
ST Employee

I see two hypotesis for the moment:

If different binaries (applications) but same board: 

  • Different linker files: see from where you're executing (see AN4667)
  • Cache enabled/disabled or ART enabled/Disabled in the two applications
  • You don't have the same system clock config
  • Different code optimization: Rejected in the previous comment

If different boards but the same binary: 

  • Issue on clock source. Try to output the system clock on one of the MCOx pins to check the frequency of the system on both devices. Is it the same?
To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.
fkst
Associate II

@SofLit @Tesla DeLorean @AScha.3 

I have now changed to a testing method, running the memset() that comes with the official library and the memset() found online in the same MCU for comparison. It is found that the execution time of the latter is much smaller than that of the former. This test can exclude the difference between the two MCUs.

The function is below:

void * memset(void * base_ptr, int x, size_t length) {
  const uint32_t int_size = sizeof(uint32_t);
  static_assert(sizeof(uint32_t) == 4, "only supports 32 bit size");
  // find first word-aligned address
  uint32_t ptr = (uint32_t) base_ptr;
  // get end of memory to set
  uint32_t end = ptr + length;
  // get location of first word-aligned address at/after the start, but not
  // after the end
  uint32_t mid1 = (ptr + int_size - 1) / int_size * int_size;
  if (mid1 > end) {
    mid1 = end;
  }
  // get location of last word-aligned address at/before the end
  uint32_t mid3 = end / int_size * int_size;
  // get end location of optimized section
  uint32_t mid2 = mid1 + (mid3 - mid1) / (4 * int_size) * (4 * int_size);
  // create a word-sized integer
  uint32_t value = 0;
  for (uint16_t i = 0; i < int_size; ++i) {
    value <<= 8;
    value |= (uint8_t) x;
  }
  __ASM volatile (
  // store bytes
  "b Compare1%=\n"
  "Store1%=:\n"
  "strb %[value], [%[ptr]], #1\n"
  "Compare1%=:\n"
  "cmp %[ptr], %[mid1]\n"
  "bcc Store1%=\n"
  // store words optimized
  "b Compare2%=\n"
  "Store2%=:\n"
  "str %[value], [%[ptr]], #4\n"
  "str %[value], [%[ptr]], #4\n"
  "str %[value], [%[ptr]], #4\n"
  "str %[value], [%[ptr]], #4\n"
  "Compare2%=:\n"
  "cmp %[ptr], %[mid2]\n"
  "bcc Store2%=\n"
  // store words
  "b Compare3%=\n"
  "Store3%=:\n"
  "str %[value], [%[ptr]], #4\n"
  "Compare3%=:\n"
  "cmp %[ptr], %[mid3]\n"
  "bcc Store3%=\n"
  // store bytes
  "b Compare4%=\n"
  "Store4%=:\n"
  "strb %[value], [%[ptr]], #1\n"
  "Compare4%=:\n"
  "cmp %[ptr], %[end]\n"
  "bcc Store4%=\n"
  : // no outputs
  : [value] "r"(value),
  [ptr] "r"(ptr),
  [mid1] "r"(mid1),
  [mid2] "r"(mid2),
  [mid3] "r"(mid3),
  [end] "r"(end)
  );
  return base_ptr;
}

 

LCE
Principal

I would use the cycle counter, not rely on HAL GPIO settings.

Also, better turn off ISR execution.

/* CPU cycle count activation for debugging - STM32F767 */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->LAR = 0xC5ACCE55;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;

...

/* check function time with cycle counts */
__disable_irq();
u32CycFuncStart = DWT->CYCCNT;
FunctionUnderTest();
u32CycFuncStop = DWT->CYCCNT;
__enable_irq();