cancel
Showing results for 
Search instead for 
Did you mean: 

Cycles or duration of float Multiplication on STM32F4

phkloth
Associate II

Hi all,

I have a question concerning the duration of float multiplication on the STM32F4. As far as I know, the chip has an FPU.

I run the following code snippet

float LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	S->x2 = input;
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
					  S->p3 * S->x2 +
					  S->p4 * S->x1
					);
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
	S->x1 = S->x2;
 
	return S->y1;
 
}

The duration between the Pin Set and Reset is about 170ns. The processor runs on 180MHz.

I find the 170ns for 4 Multiplications pretty long. However, it is with float variables. Can anybody comment on this? Do I have to activate the FPU somewhere in the Project?

In the project properties, you can find

0693W00000QNn4MQAT.pngThanks in advance for your help!

1 ACCEPTED SOLUTION

Accepted Solutions

There's no problem with pointers here. S is parameter so is in r0 at the function entry; note that the very first instruction uses it straighforwardly.

If S points to RAM instead of CCMRAM, there's a penalty - reading from RAM as reading from any resource through the S-port of processor involves automatically one extra waitstate. Try moving the target struct to CCMRAM.

> > the first operation (saving 3 in s->y1) takes about 40ns.

> 8002b58: 4a12 ldr r2, [pc, #72] ; (8002ba4 <LagCompensator_process+0x58>)

This is read from data bus of the FLASH. If this value hasn't been cached (see ART), and you run at 180MHz thus probably 5 FLASH waitstates, the load alone takes 6 cycles plus ld execution is one more cycle, so that's 7 cycles = 39ns.

Btw. note that the compiler did not care to use the float registers for this operation - and 0x4040'0000 is indeed 3.0f.

IMO the above sequence is as optimal as it gets. 170ns is around 30 cycles. There are 4 muls and 2 adds, plus 7 loads and two saves. Loads are naturally 2 cycle, with an additional waitstate because of loading through S-port it's 3 cycles per load, that's 4+2+3*7+2=29 cycles, plus some odd cycle because of the 6-cycle FLASH latency can't supply sustained stream of 32-bit instructions.

As I've said, you can try to move the data struct to CCMRAM. You can also try to run code from RAM, but you may be surprised to find that it's not straighforwardly better - read Technical Update 1, pp. 36-45.

JW

View solution in original post

12 REPLIES 12
LCE
Principal

It looks like the FPU is active.

170 ns is not too bad, that's about 30 clock cycles at 180 MHz.

There are not only the 4 multiplications, also some adding, and getting data to the GPIOs also takes some cycles.

LCE
Principal

If you need to know the exact amount of clock cycles, use the cycle counter.

That's from my F7 code (including reading from the PTP registers, you should skip that...):

/* CPU cycle count activation for debugging */
#if DEBUG_CPU_TIMING
	CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
	DWT->LAR = 0xC5ACCE55;
	DWT->CYCCNT = 0;
	DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
	DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;
#endif
 
/* debug timing cpu-cycle check */
/* START */
#if DEBUG_CPU_TIMING
	uint32_t u32StartNanoSec = ETH->PTPTSLR;
 
	asm volatile ("" : : : "memory");
	volatile uint32_t u32DwtCycCnt1 = DWT->CYCCNT;
	asm volatile ("" : : : "memory");
#endif
 
...process you want to check...
 
/* debug timing cpu-cycle check */
/* STOP */
#if DEBUG_CPU_TIMING
	asm volatile ("" : : : "memory");
	volatile uint32_t u32DwtCycCnt2 = DWT->CYCCNT;
	asm volatile ("" : : : "memory");
 
	uint32_t u32StopNanoSec = ETH->PTPTSLR;
 #endif

phkloth
Associate II

Hey LCE,

thank you very much for the comment and your help. The counter for cycles looks interesting.

However, I am quite a beginner in MPU programming. I am happy that I can master the LowLevel instruction set of the HAL library. Can you give me some advice on how to implement and use your suggested routine?

BTW: The GPIO Set and Reset take about 6ns.

ARM is a typical RISC, i.e. load/store architecture. In other words, every variable has to be loaded from memory, then arithmetics is performed, then stored back into memory. Load/store may take up significant portion of the process.

Also, fetching instructions takes time.

Show disasm of operation in question.

JW

phkloth
Associate II

Hey JW,

you are right. It is not the mathematical operations that take time. It is the loading and saving of data.

float LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	S->x2 = input;
 
	float a=4.3653;
	float b=433.5356;
	float c=0;
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
	S->y1=3;
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
	
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
	c=a*b;
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
					  S->p3 * S->x2 +
					  S->p4 * S->x1
					);
	S->x1 = S->x2;
 
	return S->y1;
 
}

the first operation (saving 3 in s->y1) takes about 40ns. The multiplication of a and b takes only 6ns.

So the problem here is the thing with the pointers. However, saving the variables globally and static would really blow up the project. And also it would make the readability of the code much worse.

Is there another tip that I don't consider here in order to improve the runtime?

> the first operation (saving 3 in s->y1) takes about 40ns.

If you see 40ns there, something may be wrong. The save itself should be hidden by buffering at processor/busmatrix interface.

As I've said, post disasm of the original code as basis for further discussion.

JW

phkloth
Associate II

hey, sorry. Here is the assembler code of this function

float LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	S->x2 = input;
 8002b4c:	ed80 0a02 	vstr	s0, [r0, #8]
 8002b50:	4b13      	ldr	r3, [pc, #76]	; (8002ba0 <LagCompensator_process+0x54>)
 8002b52:	f44f 7280 	mov.w	r2, #256	; 0x100
 8002b56:	619a      	str	r2, [r3, #24]
	float a=4.3653;
	float b=433.5356;
	float c=0;
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
	S->y1=3;
 8002b58:	4a12      	ldr	r2, [pc, #72]	; (8002ba4 <LagCompensator_process+0x58>)
 8002b5a:	6002      	str	r2, [r0, #0]
  WRITE_REG(GPIOx->BSRR, (PinMask << 16));
 8002b5c:	f04f 7280 	mov.w	r2, #16777216	; 0x1000000
 8002b60:	619a      	str	r2, [r3, #24]
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
 8002b62:	edd0 7a03 	vldr	s15, [r0, #12]
 8002b66:	ed90 0a04 	vldr	s0, [r0, #16]
 8002b6a:	ed90 7a00 	vldr	s14, [r0]
 8002b6e:	ee20 0a47 	vnmul.f32	s0, s0, s14
					  S->p3 * S->x2 +
 8002b72:	ed90 7a05 	vldr	s14, [r0, #20]
 8002b76:	edd0 6a02 	vldr	s13, [r0, #8]
 8002b7a:	ee27 7a26 	vmul.f32	s14, s14, s13
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
 8002b7e:	ee30 0a07 	vadd.f32	s0, s0, s14
					  S->p4 * S->x1
 8002b82:	ed90 7a06 	vldr	s14, [r0, #24]
 8002b86:	ed90 6a01 	vldr	s12, [r0, #4]
 8002b8a:	ee27 7a06 	vmul.f32	s14, s14, s12
					  S->p3 * S->x2 +
 8002b8e:	ee30 0a07 	vadd.f32	s0, s0, s14
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
 8002b92:	ee27 0a80 	vmul.f32	s0, s15, s0
 8002b96:	ed80 0a00 	vstr	s0, [r0]
					);
	S->x1 = S->x2;
 8002b9a:	edc0 6a01 	vstr	s13, [r0, #4]
 
	return S->y1;
 
}
 8002b9e:	4770      	bx	lr
 8002ba0:	40020800 	.word	0x40020800
 8002ba4:	40400000 	.word	0x40400000

There's no problem with pointers here. S is parameter so is in r0 at the function entry; note that the very first instruction uses it straighforwardly.

If S points to RAM instead of CCMRAM, there's a penalty - reading from RAM as reading from any resource through the S-port of processor involves automatically one extra waitstate. Try moving the target struct to CCMRAM.

> > the first operation (saving 3 in s->y1) takes about 40ns.

> 8002b58: 4a12 ldr r2, [pc, #72] ; (8002ba4 <LagCompensator_process+0x58>)

This is read from data bus of the FLASH. If this value hasn't been cached (see ART), and you run at 180MHz thus probably 5 FLASH waitstates, the load alone takes 6 cycles plus ld execution is one more cycle, so that's 7 cycles = 39ns.

Btw. note that the compiler did not care to use the float registers for this operation - and 0x4040'0000 is indeed 3.0f.

IMO the above sequence is as optimal as it gets. 170ns is around 30 cycles. There are 4 muls and 2 adds, plus 7 loads and two saves. Loads are naturally 2 cycle, with an additional waitstate because of loading through S-port it's 3 cycles per load, that's 4+2+3*7+2=29 cycles, plus some odd cycle because of the 6-cycle FLASH latency can't supply sustained stream of 32-bit instructions.

As I've said, you can try to move the data struct to CCMRAM. You can also try to run code from RAM, but you may be surprised to find that it's not straighforwardly better - read Technical Update 1, pp. 36-45.

JW

phkloth
Associate II

ok thank you very much!