Skip to main content
phkloth
Associate II
August 16, 2022
Solved

Cycles or duration of float Multiplication on STM32F4

  • August 16, 2022
  • 12 replies
  • 8805 views

Hi all,

I have a question concerning the duration of float multiplication on the STM32F4. As far as I know, the chip has an FPU.

I run the following code snippet

float LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	S->x2 = input;
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
					 S->p3 * S->x2 +
					 S->p4 * S->x1
					);
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
	S->x1 = S->x2;
 
	return S->y1;
 
}

The duration between the Pin Set and Reset is about 170ns. The processor runs on 180MHz.

I find the 170ns for 4 Multiplications pretty long. However, it is with float variables. Can anybody comment on this? Do I have to activate the FPU somewhere in the Project?

In the project properties, you can find

0693W00000QNn4MQAT.pngThanks in advance for your help!

This topic has been closed for replies.
Best answer by waclawek.jan

There's no problem with pointers here. S is parameter so is in r0 at the function entry; note that the very first instruction uses it straighforwardly.

If S points to RAM instead of CCMRAM, there's a penalty - reading from RAM as reading from any resource through the S-port of processor involves automatically one extra waitstate. Try moving the target struct to CCMRAM.

> > the first operation (saving 3 in s->y1) takes about 40ns.

> 8002b58: 4a12 ldr r2, [pc, #72] ; (8002ba4 <LagCompensator_process+0x58>)

This is read from data bus of the FLASH. If this value hasn't been cached (see ART), and you run at 180MHz thus probably 5 FLASH waitstates, the load alone takes 6 cycles plus ld execution is one more cycle, so that's 7 cycles = 39ns.

Btw. note that the compiler did not care to use the float registers for this operation - and 0x4040'0000 is indeed 3.0f.

IMO the above sequence is as optimal as it gets. 170ns is around 30 cycles. There are 4 muls and 2 adds, plus 7 loads and two saves. Loads are naturally 2 cycle, with an additional waitstate because of loading through S-port it's 3 cycles per load, that's 4+2+3*7+2=29 cycles, plus some odd cycle because of the 6-cycle FLASH latency can't supply sustained stream of 32-bit instructions.

As I've said, you can try to move the data struct to CCMRAM. You can also try to run code from RAM, but you may be surprised to find that it's not straighforwardly better - read Technical Update 1, pp. 36-45.

JW

12 replies

LCE
Principal II
August 16, 2022

It looks like the FPU is active.

170 ns is not too bad, that's about 30 clock cycles at 180 MHz.

There are not only the 4 multiplications, also some adding, and getting data to the GPIOs also takes some cycles.

LCE
Principal II
August 16, 2022

If you need to know the exact amount of clock cycles, use the cycle counter.

That's from my F7 code (including reading from the PTP registers, you should skip that...):

/* CPU cycle count activation for debugging */
#if DEBUG_CPU_TIMING
	CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
	DWT->LAR = 0xC5ACCE55;
	DWT->CYCCNT = 0;
	DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
	DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;
#endif
 
/* debug timing cpu-cycle check */
/* START */
#if DEBUG_CPU_TIMING
	uint32_t u32StartNanoSec = ETH->PTPTSLR;
 
	asm volatile ("" : : : "memory");
	volatile uint32_t u32DwtCycCnt1 = DWT->CYCCNT;
	asm volatile ("" : : : "memory");
#endif
 
...process you want to check...
 
/* debug timing cpu-cycle check */
/* STOP */
#if DEBUG_CPU_TIMING
	asm volatile ("" : : : "memory");
	volatile uint32_t u32DwtCycCnt2 = DWT->CYCCNT;
	asm volatile ("" : : : "memory");
 
	uint32_t u32StopNanoSec = ETH->PTPTSLR;
 #endif

phkloth
phklothAuthor
Associate II
August 16, 2022

Hey LCE,

thank you very much for the comment and your help. The counter for cycles looks interesting.

However, I am quite a beginner in MPU programming. I am happy that I can master the LowLevel instruction set of the HAL library. Can you give me some advice on how to implement and use your suggested routine?

BTW: The GPIO Set and Reset take about 6ns.

waclawek.jan
Super User
August 16, 2022

ARM is a typical RISC, i.e. load/store architecture. In other words, every variable has to be loaded from memory, then arithmetics is performed, then stored back into memory. Load/store may take up significant portion of the process.

Also, fetching instructions takes time.

Show disasm of operation in question.

JW

phkloth
phklothAuthor
Associate II
August 16, 2022

Hey JW,

you are right. It is not the mathematical operations that take time. It is the loading and saving of data.

float LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	S->x2 = input;
 
	float a=4.3653;
	float b=433.5356;
	float c=0;
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
	S->y1=3;
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
	
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
	c=a*b;
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
					 S->p3 * S->x2 +
					 S->p4 * S->x1
					);
	S->x1 = S->x2;
 
	return S->y1;
 
}

the first operation (saving 3 in s->y1) takes about 40ns. The multiplication of a and b takes only 6ns.

So the problem here is the thing with the pointers. However, saving the variables globally and static would really blow up the project. And also it would make the readability of the code much worse.

Is there another tip that I don't consider here in order to improve the runtime?

waclawek.jan
Super User
August 16, 2022

> the first operation (saving 3 in s->y1) takes about 40ns.

If you see 40ns there, something may be wrong. The save itself should be hidden by buffering at processor/busmatrix interface.

As I've said, post disasm of the original code as basis for further discussion.

JW

phkloth
phklothAuthor
Associate II
August 16, 2022

hey, sorry. Here is the assembler code of this function

float LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	S->x2 = input;
 8002b4c:	ed80 0a02 	vstr	s0, [r0, #8]
 8002b50:	4b13 	ldr	r3, [pc, #76]	; (8002ba0 <LagCompensator_process+0x54>)
 8002b52:	f44f 7280 	mov.w	r2, #256	; 0x100
 8002b56:	619a 	str	r2, [r3, #24]
	float a=4.3653;
	float b=433.5356;
	float c=0;
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
	S->y1=3;
 8002b58:	4a12 	ldr	r2, [pc, #72]	; (8002ba4 <LagCompensator_process+0x58>)
 8002b5a:	6002 	str	r2, [r0, #0]
 WRITE_REG(GPIOx->BSRR, (PinMask << 16));
 8002b5c:	f04f 7280 	mov.w	r2, #16777216	; 0x1000000
 8002b60:	619a 	str	r2, [r3, #24]
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
 8002b62:	edd0 7a03 	vldr	s15, [r0, #12]
 8002b66:	ed90 0a04 	vldr	s0, [r0, #16]
 8002b6a:	ed90 7a00 	vldr	s14, [r0]
 8002b6e:	ee20 0a47 	vnmul.f32	s0, s0, s14
					 S->p3 * S->x2 +
 8002b72:	ed90 7a05 	vldr	s14, [r0, #20]
 8002b76:	edd0 6a02 	vldr	s13, [r0, #8]
 8002b7a:	ee27 7a26 	vmul.f32	s14, s14, s13
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
 8002b7e:	ee30 0a07 	vadd.f32	s0, s0, s14
					 S->p4 * S->x1
 8002b82:	ed90 7a06 	vldr	s14, [r0, #24]
 8002b86:	ed90 6a01 	vldr	s12, [r0, #4]
 8002b8a:	ee27 7a06 	vmul.f32	s14, s14, s12
					 S->p3 * S->x2 +
 8002b8e:	ee30 0a07 	vadd.f32	s0, s0, s14
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
 8002b92:	ee27 0a80 	vmul.f32	s0, s15, s0
 8002b96:	ed80 0a00 	vstr	s0, [r0]
					);
	S->x1 = S->x2;
 8002b9a:	edc0 6a01 	vstr	s13, [r0, #4]
 
	return S->y1;
 
}
 8002b9e:	4770 	bx	lr
 8002ba0:	40020800 	.word	0x40020800
 8002ba4:	40400000 	.word	0x40400000

waclawek.jan
waclawek.janBest answer
Super User
August 16, 2022

There's no problem with pointers here. S is parameter so is in r0 at the function entry; note that the very first instruction uses it straighforwardly.

If S points to RAM instead of CCMRAM, there's a penalty - reading from RAM as reading from any resource through the S-port of processor involves automatically one extra waitstate. Try moving the target struct to CCMRAM.

> > the first operation (saving 3 in s->y1) takes about 40ns.

> 8002b58: 4a12 ldr r2, [pc, #72] ; (8002ba4 <LagCompensator_process+0x58>)

This is read from data bus of the FLASH. If this value hasn't been cached (see ART), and you run at 180MHz thus probably 5 FLASH waitstates, the load alone takes 6 cycles plus ld execution is one more cycle, so that's 7 cycles = 39ns.

Btw. note that the compiler did not care to use the float registers for this operation - and 0x4040'0000 is indeed 3.0f.

IMO the above sequence is as optimal as it gets. 170ns is around 30 cycles. There are 4 muls and 2 adds, plus 7 loads and two saves. Loads are naturally 2 cycle, with an additional waitstate because of loading through S-port it's 3 cycles per load, that's 4+2+3*7+2=29 cycles, plus some odd cycle because of the 6-cycle FLASH latency can't supply sustained stream of 32-bit instructions.

As I've said, you can try to move the data struct to CCMRAM. You can also try to run code from RAM, but you may be surprised to find that it's not straighforwardly better - read Technical Update 1, pp. 36-45.

JW

phkloth
phklothAuthor
Associate II
August 16, 2022

ok thank you very much!

phkloth
phklothAuthor
Associate II
August 17, 2022

Another question regarding the runtime of functions. Maybe this is normal. But is there any way to speed it up?

The function call itself costs 200ns when setting the pin on/off like this

void Filtering_function(float input)
{
 S->input=input; 
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	LagCompensator_process(&S->LagCompensator2_horiz, S->input);
 
}
 
 
 
void LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
 S->y1 = S->p1 *	(-S->p2 * S->y1 +
					 S->p3 * input +
					 S->p4 * S->x1
					);
	S->x1 = S->x2;
 
	return S->y1;
 
}

The Assembler code for the function call looks like this

//
			LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
			LagCompensator_process(&S->LagCompensator2_horiz, &S->ControllerOutFloat);
 8004306:	f504 70aa 	add.w	r0, r4, #340	; 0x154
 800430a:	619a 	str	r2, [r3, #24]
 800430c:	f7fe f9ea 	bl	80026e4 <LagCompensator_process>

I am doing extensive Filtering on 8 input channels, filtering each input with 6 filters.

If each filter call costs 400ns this will result in 8*6*400ns = 20us.

Optimizing the filter call would give me a lot of space for other operations.

At the moment:

ADC read 8 channels 10us (SPI)

DAC write 8 channels 15us (SPI)

Calculations 20us

-> in total 45us

So my runtime is roughly 20kHz. As I am doing feedback control, every kHz counts :)