Skip to main content
LCE
Principal II
November 8, 2021
Solved

STM32F7 @ 192MHz: long data copy time?

  • November 8, 2021
  • 6 replies
  • 2085 views

Hello again,

I recently checked the time the STM32F7 (192MHz) needs to copy one 32bit buffer to another one, and I'm a little shocked about how long that takes.

(Maybe I expect too much, coming from 8bit µCs with 20MHz, an STM32 still feels like a racing car compared to a horse carriage)

For time measurement the interrupts are disabled and the PTP registers are used.

It takes about 38µs to copy a uint32_t buffer[256] to another one, that is about 28 cycles @192MHz. Okay, there's the for-loop comparison and increment, but still, 28 cycles?

Anything I oversee or is that just it?

I know it can be a long way from C to assembler, and I haven't checked that yet (never even looked at the STM32 / ARM assembler stuff), but the list excerpt below shows some 17 CPU actions for the for-loop - right?

Here's the uart/debug output:

*.c:

#define RPM_DMA_BUF_MAX		256
 
uint32_t u32RpmDmaBuf[2][RPM_DMA_BUF_MAX] = { { 0 } };
 
uint32_t u32RpmUartBuf[RPM_DMA_BUF_MAX];
uint32_t *pu32SrcBuf;
 
uint32_t u32StopNanoSec = 0;
uint32_t u32StartNanoSec = 0;
 
/* copy buffer with snapshot of timestamp before and after */ 
__disable_irq();
u32Val = ETH->PTPTSLR;
 
for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
{
 u32RpmUartBuf[i16] = *(pu32SrcBuf++);
}
u32Val2 = ETH->PTPTSLR;
__enable_irq();
 
u32StartNanoSec = ETH_PTPSubSecond2NanoSecond(u32Val);
u32StopNanoSec = ETH_PTPSubSecond2NanoSecond(u32Val2);
 
u32RpmDebugTime = 0;
if( u32StopNanoSec > u32StartNanoSec ) u32RpmDebugTime = u32StopNanoSec - u32StartNanoSec;
uart_printf("u32StopNanoSec = %ld\n\r", u32StopNanoSec);
uart_printf("u32StartNanoSec = %ld\n\r", u32StartNanoSec);
uart_printf("u32buffer[%d] copy time: %ld ns\n\r", RPM_DMA_BUF_MAX, u32RpmDebugTime);

*.list

__disable_irq();
u32Val = ETH->PTPTSLR;
 8006b42:	4bad 	ldr	r3, [pc, #692]	; (8006df8 <UART3_RxCmndProcessing+0x17d8>)
 8006b44:	f8d3 370c 	ldr.w	r3, [r3, #1804]	; 0x70c
 8006b48:	f8c7 3498 	str.w	r3, [r7, #1176]	; 0x498
						for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
 8006b4c:	2300 	movs	r3, #0
 8006b4e:	f8a7 3504 	strh.w	r3, [r7, #1284]	; 0x504
 8006b52:	e010 	b.n	8006b76 <UART3_RxCmndProcessing+0x1556>
						{
							//u32RpmUartBuf[i16] = u32RpmDmaBuf[u8RpmBufPtr][i16];
							//u32RpmUartBuf[i16] = *(pu32SrcBuf++);
							*(pu32DstBuf++) = *(pu32SrcBuf++);
 8006b54:	f8d7 24ec 	ldr.w	r2, [r7, #1260]	; 0x4ec
 8006b58:	1d13 	adds	r3, r2, #4
 8006b5a:	f8c7 34ec 	str.w	r3, [r7, #1260]	; 0x4ec
 8006b5e:	f8d7 34e8 	ldr.w	r3, [r7, #1256]	; 0x4e8
 8006b62:	1d19 	adds	r1, r3, #4
 8006b64:	f8c7 14e8 	str.w	r1, [r7, #1256]	; 0x4e8
 8006b68:	6812 	ldr	r2, [r2, #0]
 8006b6a:	601a 	str	r2, [r3, #0]
						for( i16 = 0; i16 < RPM_DMA_BUF_MAX; i16++ )
 8006b6c:	f8b7 3504 	ldrh.w	r3, [r7, #1284]	; 0x504
 8006b70:	3301 	adds	r3, #1
 8006b72:	f8a7 3504 	strh.w	r3, [r7, #1284]	; 0x504
 8006b76:	f8b7 3504 	ldrh.w	r3, [r7, #1284]	; 0x504
 8006b7a:	2bff 	cmp	r3, #255	; 0xff
 8006b7c:	d9ea 	bls.n	8006b54 <UART3_RxCmndProcessing+0x1534>
						}
u32Val2 = ETH->PTPTSLR;
 8006b7e:	4b9e 	ldr	r3, [pc, #632]	; (8006df8 <UART3_RxCmndProcessing+0x17d8>)
 8006b80:	f8d3 370c 	ldr.w	r3, [r3, #1804]	; 0x70c
 8006b84:	f8c7 3494 	str.w	r3, [r7, #1172]	; 0x494
 __ASM volatile ("cpsie i" : : : "memory");
 8006b88:	b662 	cpsie	i
}
 8006b8a:	bf00 	nop
__enable_irq();

This topic has been closed for replies.
Best answer by KnarfB

What are your compiler & compiler settings/options? A Debug configuration does not generate fast code. Did you try a Release config?

hth

KnarfB

6 replies

KnarfB
KnarfBBest answer
Super User
November 8, 2021

What are your compiler & compiler settings/options? A Debug configuration does not generate fast code. Did you try a Release config?

hth

KnarfB

LCE
LCEAuthor
Principal II
November 8, 2021

Thanks for the reply, and sorry for the missing info!

STM32CubeIDE 1.6 - and yes, debug config.

I'll try the release...

LCE
LCEAuthor
Principal II
November 8, 2021

Holy Smokes, THAT helped!

Thanks a lot, KnarfB!

Copy time has gone down from 38µs (debug) to 2.8µs (release) !

That seems almost too fast with about 2 cycles per copy...

One more question concerning the *.list file:

Is there a setting where showing the C source in the list file can be turned on?

I've been through many project settings in the IDE, but haven't found anything yet.

Bassett.David
Associate III
November 8, 2021

If using Keil, look for a .txt file - much mire useful for analyzing the generated code!

Regards,

Dave

Tesla DeLorean
Guru
November 8, 2021
Tips, Buy me a coffee, or three.. PayPal VenmoUp vote any posts that you find helpful, it shows what's working..
LCE
LCEAuthor
Principal II
November 8, 2021

Thanks for the info! Lots to learn for me about ARMs.

LCE
LCEAuthor
Principal II
November 18, 2021

Hello again,

after having programmed some other stuff and checking no more the data copy time,

I'm back at this point again, and guess what, copy time is again much higher!

I have NOT changed the code above, which checks the time and copies the buffers, and

I have not switched back to debug mode, but now I get a copy time of about 33 µs - which was

about 3µs after switching from compiler debug to release mode (still using STM32CubeIDE).

Any ideas, please?

Maybe @KnarfB​  or @Community member​  have some idea about that?

KnarfB
Super User
November 18, 2021

Hi,

If you have more lines in your code, the compiler may optimize by moving the memory load/store instructions around or even change their order. This will not affect the C semantics of your code but the timing. E.g., the compiler tries to load as early as possible to hide memory latencies.

You can tell the compiler not to move around the code by placing compile-time memory barriers. Shown here for gcc and the Cortex core cycle counter:

asm volatile ("" : : : "memory");		// see https://blog.regehr.org/archives/28
volatile uint32_t tick = DWT->CYCCNT;
asm volatile ("" : : : "memory");
 
// code of interest
 
asm volatile ("" : : : "memory");
volatile uint32_t tock = DWT->CYCCNT;
asm volatile ("" : : : "memory");

Strictly speaking, the volatile is not neccessary to suppress the reordering but avoids that the compiler optimizes away seemingly unused variables.

If possible, check the assembler code, e.g. by stepping though the loop at instruction level.

One last remark: when you have several pointers around, think of defining restrict pointers https://en.wikipedia.org/wiki/Restrict to guide optimization.

hth

KnarfB

LCE
LCEAuthor
Principal II
November 18, 2021

Thanks KnarfB!

Interesting effects, at first the "asm volatile..." lines didn't do anything, now after some more cleans and builds timing is back where it should be.

So I am not so sure if that was "compiler luck" or the lines you mentioned.

Anyway, thanks for the help - and the link about volatile.

Coming from the hardware side, "volatile" is something I somehow always tried to avoid and thus never really understood, just remembering that some volatile declaration troubled me a long time ago when working with some 8-bit controllers / compilers...