Cycles or duration of float Multiplication on STM32F4

phkloth · ‎2022-08-16

Hi all,

I have a question concerning the duration of float multiplication on the STM32F4. As far as I know, the chip has an FPU.

I run the following code snippet

float LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	S->x2 = input;
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	S->y1 = S->p1 *	(-S->p2 * S->y1 +
					  S->p3 * S->x2 +
					  S->p4 * S->x1
					);
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
	S->x1 = S->x2;
 
	return S->y1;
 
}

The duration between the Pin Set and Reset is about 170ns. The processor runs on 180MHz.

I find the 170ns for 4 Multiplications pretty long. However, it is with float variables. Can anybody comment on this? Do I have to activate the FPU somewhere in the Project?

In the project properties, you can find

Thanks in advance for your help!

phkloth · ‎2022-08-17

Another question regarding the runtime of functions. Maybe this is normal. But is there any way to speed it up?

The function call itself costs 200ns when setting the pin on/off like this

void Filtering_function(float input)
{
      S->input=input;       
 
	LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
	LagCompensator_process(&S->LagCompensator2_horiz, S->input);
 
}
 
 
 
void LagCompensator_process(LagCompensator_instance* S, float input)
{
 
	LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
        S->y1 = S->p1 *	(-S->p2 * S->y1 +
					  S->p3 * input +
					  S->p4 * S->x1
					);
	S->x1 = S->x2;
 
	return S->y1;
 
}

The Assembler code for the function call looks like this

//
			LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
 
			LagCompensator_process(&S->LagCompensator2_horiz, &S->ControllerOutFloat);
 8004306:	f504 70aa 	add.w	r0, r4, #340	; 0x154
 800430a:	619a      	str	r2, [r3, #24]
 800430c:	f7fe f9ea 	bl	80026e4 <LagCompensator_process>

I am doing extensive Filtering on 8 input channels, filtering each input with 6 filters.

If each filter call costs 400ns this will result in 8*6*400ns = 20us.

Optimizing the filter call would give me a lot of space for other operations.

At the moment:

ADC read 8 channels 10us (SPI)

DAC write 8 channels 15us (SPI)

Calculations 20us

-> in total 45us

So my runtime is roughly 20kHz. As I am doing feedback control, every kHz counts :)

waclawek.jan · ‎2022-08-17

The call (bl) itself won't take 72 cycles under no reasonable scenario, so it's probably something at the entry of called function, maybe stacking of registers. Duration of that then depends not only on number of registers to stack, but also on the type of memory where the stack is located. Again, stack should be in CCMRAM. But it may be something else, too.

Generally, steps to improve performance is loops unrolling and function inlining (either through what the compiler provides, or explicitly). Plus moving things around in various memories, as I've already told you above. Note, that compiler compiles things differently when circumstances change. For ultimate control resort to asm.

Performance doesn't come cheap.

JW

phkloth · ‎2022-08-17

Ok. It is at least nice to know that you would expect faster runtimes normally. So I will play around a bit and try to increase the performance.

Thank you very much for your help Jan!