2022-08-16 03:21 AM
Hi all,
I have a question concerning the duration of float multiplication on the STM32F4. As far as I know, the chip has an FPU.
I run the following code snippet
float LagCompensator_process(LagCompensator_instance* S, float input)
{
S->x2 = input;
LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
S->y1 = S->p1 * (-S->p2 * S->y1 +
S->p3 * S->x2 +
S->p4 * S->x1
);
LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
S->x1 = S->x2;
return S->y1;
}
The duration between the Pin Set and Reset is about 170ns. The processor runs on 180MHz.
I find the 170ns for 4 Multiplications pretty long. However, it is with float variables. Can anybody comment on this? Do I have to activate the FPU somewhere in the Project?
In the project properties, you can find
Thanks in advance for your help!
Solved! Go to Solution.
2022-08-16 06:16 AM
There's no problem with pointers here. S is parameter so is in r0 at the function entry; note that the very first instruction uses it straighforwardly.
If S points to RAM instead of CCMRAM, there's a penalty - reading from RAM as reading from any resource through the S-port of processor involves automatically one extra waitstate. Try moving the target struct to CCMRAM.
> > the first operation (saving 3 in s->y1) takes about 40ns.
> 8002b58: 4a12 ldr r2, [pc, #72] ; (8002ba4 <LagCompensator_process+0x58>)
This is read from data bus of the FLASH. If this value hasn't been cached (see ART), and you run at 180MHz thus probably 5 FLASH waitstates, the load alone takes 6 cycles plus ld execution is one more cycle, so that's 7 cycles = 39ns.
Btw. note that the compiler did not care to use the float registers for this operation - and 0x4040'0000 is indeed 3.0f.
IMO the above sequence is as optimal as it gets. 170ns is around 30 cycles. There are 4 muls and 2 adds, plus 7 loads and two saves. Loads are naturally 2 cycle, with an additional waitstate because of loading through S-port it's 3 cycles per load, that's 4+2+3*7+2=29 cycles, plus some odd cycle because of the 6-cycle FLASH latency can't supply sustained stream of 32-bit instructions.
As I've said, you can try to move the data struct to CCMRAM. You can also try to run code from RAM, but you may be surprised to find that it's not straighforwardly better - read Technical Update 1, pp. 36-45.
JW
2022-08-16 03:29 AM
It looks like the FPU is active.
170 ns is not too bad, that's about 30 clock cycles at 180 MHz.
There are not only the 4 multiplications, also some adding, and getting data to the GPIOs also takes some cycles.
2022-08-16 03:33 AM
If you need to know the exact amount of clock cycles, use the cycle counter.
That's from my F7 code (including reading from the PTP registers, you should skip that...):
/* CPU cycle count activation for debugging */
#if DEBUG_CPU_TIMING
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->LAR = 0xC5ACCE55;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
DWT->CTRL |= DWT_CTRL_PCSAMPLENA_Msk;
#endif
/* debug timing cpu-cycle check */
/* START */
#if DEBUG_CPU_TIMING
uint32_t u32StartNanoSec = ETH->PTPTSLR;
asm volatile ("" : : : "memory");
volatile uint32_t u32DwtCycCnt1 = DWT->CYCCNT;
asm volatile ("" : : : "memory");
#endif
...process you want to check...
/* debug timing cpu-cycle check */
/* STOP */
#if DEBUG_CPU_TIMING
asm volatile ("" : : : "memory");
volatile uint32_t u32DwtCycCnt2 = DWT->CYCCNT;
asm volatile ("" : : : "memory");
uint32_t u32StopNanoSec = ETH->PTPTSLR;
#endif
2022-08-16 03:44 AM
Hey LCE,
thank you very much for the comment and your help. The counter for cycles looks interesting.
However, I am quite a beginner in MPU programming. I am happy that I can master the LowLevel instruction set of the HAL library. Can you give me some advice on how to implement and use your suggested routine?
BTW: The GPIO Set and Reset take about 6ns.
2022-08-16 04:07 AM
ARM is a typical RISC, i.e. load/store architecture. In other words, every variable has to be loaded from memory, then arithmetics is performed, then stored back into memory. Load/store may take up significant portion of the process.
Also, fetching instructions takes time.
Show disasm of operation in question.
JW
2022-08-16 04:30 AM
Hey JW,
you are right. It is not the mathematical operations that take time. It is the loading and saving of data.
float LagCompensator_process(LagCompensator_instance* S, float input)
{
S->x2 = input;
float a=4.3653;
float b=433.5356;
float c=0;
LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
S->y1=3;
LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
c=a*b;
LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
S->y1 = S->p1 * (-S->p2 * S->y1 +
S->p3 * S->x2 +
S->p4 * S->x1
);
S->x1 = S->x2;
return S->y1;
}
the first operation (saving 3 in s->y1) takes about 40ns. The multiplication of a and b takes only 6ns.
So the problem here is the thing with the pointers. However, saving the variables globally and static would really blow up the project. And also it would make the readability of the code much worse.
Is there another tip that I don't consider here in order to improve the runtime?
2022-08-16 05:25 AM
> the first operation (saving 3 in s->y1) takes about 40ns.
If you see 40ns there, something may be wrong. The save itself should be hidden by buffering at processor/busmatrix interface.
As I've said, post disasm of the original code as basis for further discussion.
JW
2022-08-16 05:35 AM
hey, sorry. Here is the assembler code of this function
float LagCompensator_process(LagCompensator_instance* S, float input)
{
S->x2 = input;
8002b4c: ed80 0a02 vstr s0, [r0, #8]
8002b50: 4b13 ldr r3, [pc, #76] ; (8002ba0 <LagCompensator_process+0x54>)
8002b52: f44f 7280 mov.w r2, #256 ; 0x100
8002b56: 619a str r2, [r3, #24]
float a=4.3653;
float b=433.5356;
float c=0;
LL_GPIO_SetOutputPin(GPIOC, LL_GPIO_PIN_8);
S->y1=3;
8002b58: 4a12 ldr r2, [pc, #72] ; (8002ba4 <LagCompensator_process+0x58>)
8002b5a: 6002 str r2, [r0, #0]
WRITE_REG(GPIOx->BSRR, (PinMask << 16));
8002b5c: f04f 7280 mov.w r2, #16777216 ; 0x1000000
8002b60: 619a str r2, [r3, #24]
LL_GPIO_ResetOutputPin(GPIOC, LL_GPIO_PIN_8);
S->y1 = S->p1 * (-S->p2 * S->y1 +
8002b62: edd0 7a03 vldr s15, [r0, #12]
8002b66: ed90 0a04 vldr s0, [r0, #16]
8002b6a: ed90 7a00 vldr s14, [r0]
8002b6e: ee20 0a47 vnmul.f32 s0, s0, s14
S->p3 * S->x2 +
8002b72: ed90 7a05 vldr s14, [r0, #20]
8002b76: edd0 6a02 vldr s13, [r0, #8]
8002b7a: ee27 7a26 vmul.f32 s14, s14, s13
S->y1 = S->p1 * (-S->p2 * S->y1 +
8002b7e: ee30 0a07 vadd.f32 s0, s0, s14
S->p4 * S->x1
8002b82: ed90 7a06 vldr s14, [r0, #24]
8002b86: ed90 6a01 vldr s12, [r0, #4]
8002b8a: ee27 7a06 vmul.f32 s14, s14, s12
S->p3 * S->x2 +
8002b8e: ee30 0a07 vadd.f32 s0, s0, s14
S->y1 = S->p1 * (-S->p2 * S->y1 +
8002b92: ee27 0a80 vmul.f32 s0, s15, s0
8002b96: ed80 0a00 vstr s0, [r0]
);
S->x1 = S->x2;
8002b9a: edc0 6a01 vstr s13, [r0, #4]
return S->y1;
}
8002b9e: 4770 bx lr
8002ba0: 40020800 .word 0x40020800
8002ba4: 40400000 .word 0x40400000
2022-08-16 06:16 AM
There's no problem with pointers here. S is parameter so is in r0 at the function entry; note that the very first instruction uses it straighforwardly.
If S points to RAM instead of CCMRAM, there's a penalty - reading from RAM as reading from any resource through the S-port of processor involves automatically one extra waitstate. Try moving the target struct to CCMRAM.
> > the first operation (saving 3 in s->y1) takes about 40ns.
> 8002b58: 4a12 ldr r2, [pc, #72] ; (8002ba4 <LagCompensator_process+0x58>)
This is read from data bus of the FLASH. If this value hasn't been cached (see ART), and you run at 180MHz thus probably 5 FLASH waitstates, the load alone takes 6 cycles plus ld execution is one more cycle, so that's 7 cycles = 39ns.
Btw. note that the compiler did not care to use the float registers for this operation - and 0x4040'0000 is indeed 3.0f.
IMO the above sequence is as optimal as it gets. 170ns is around 30 cycles. There are 4 muls and 2 adds, plus 7 loads and two saves. Loads are naturally 2 cycle, with an additional waitstate because of loading through S-port it's 3 cycles per load, that's 4+2+3*7+2=29 cycles, plus some odd cycle because of the 6-cycle FLASH latency can't supply sustained stream of 32-bit instructions.
As I've said, you can try to move the data struct to CCMRAM. You can also try to run code from RAM, but you may be surprised to find that it's not straighforwardly better - read Technical Update 1, pp. 36-45.
JW
2022-08-16 06:58 AM
ok thank you very much!