STM32H7 faster using floats than using uint32_t

JLope.11 · ‎2024-04-08

I made following two versions of the same software to process DMA data, one using floats and the second using uint32_t to try run faster, but as a surprise the first one run faster:

//161
void process_data_ADC1()
{
	float samples1=(2.0f/(float)ADC_BUF1);  // =1/samples
	uint32_t media1=0,media2=0;
	for (int i=0;i<ADC_BUF1/16;i++)
		{
			for (int j=0;j<8;j++)
				{
					media1+=buffer_ADC1[i*16+j];
					media2+=buffer_ADC1[i*16+j+8];
				}
		}
	float med1=(float) media1*samples1;
	float med2=(float) media2*samples1;
	avg16[1]=(uint16_t) (16.0f*med1+0.5f);
	avg16[2]=(uint16_t) (16.0f*med2+0.5f);
	float rms1=0.0f,rms2=0.0f,x;
	for (int i=0;i<ADC_BUF1/16;i++)
		{
			for (int j=0;j<8;j++)
				{
					x=(buffer_ADC1[i*16+j]  -med1);rms1+=x*x;//maximo 2^12 sin overflow para 12 bits
					x=(buffer_ADC1[i*16+j+8]-med2);rms2+=x*x;
				}
		}
	rms16[1]=(uint16_t) (16.0f*sqrt(rms1*samples1)+0.5f);
	rms16[2]=(uint16_t) (16.0f*sqrt(rms2*samples1)+0.5f);
}
//161b Usando uints32 en vez de float: ES MAS LENTO!!!!!!!
void process_data_ADC1_fast()
{
	float samples1=(2.0f/(float)ADC_BUF1);  // =1/samples
	uint32_t media1=0,media2=0,samples00=ADC_BUF1/2;
	for (int i=0;i<ADC_BUF1/16;i++)
		{
			for (int j=0;j<8;j++)
				{
					media1+=buffer_ADC1[i*16+j];
					media2+=buffer_ADC1[i*16+j+8];
				}
		}
	float med1=(float) media1*samples1;
	float med2=(float) media2*samples1;
	avg16[1]=(uint16_t) (16.0f*med1+0.5f);
	avg16[2]=(uint16_t) (16.0f*med2+0.5f);
	media1=media1/samples00;media2=media2/samples00;
	uint32_t rms1=0,rms2=0,x;
	for (int i=0;i<ADC_BUF1/16;i++)
		{
			for (int j=0;j<8;j++)
				{
					x=(buffer_ADC1[i*16+j]  -med1);rms1+=x*x;//maximo 2^12 sin overflow para adc de 12 bits y variables de 32 bits
					x=(buffer_ADC1[i*16+j+8]-med2);rms2+=x*x;
				}
		}
	x=16.0f*sqrt((float) rms1*samples1);rms16[1]= (uint16_t) (x+0.5f);
	x=16.0f*sqrt((float) rms2*samples1);rms16[2]= (uint16_t) (x+0.5f);
}

This is the routine used to measure time:

uint32_t measure_time(void)
{
	uint32_t static start = 0;
	uint32_t time2= SysTick->VAL;
	time2=start-time2;
	//DELAY_US(10);
	start=SysTick->VAL;;
	return (time2);
}

(It surprised to me that the systick timer runs backward)

It took in debug mode:

49321 ticks the float routine

49321 tics the uint32_t routine

Andrew Neil · ‎2024-04-08

@JLope.11 wrote:
one using floats and the second using uint32_t to try run faster, but as a surprise the first one run faster:

but then

@JLope.11 wrote:
It took in debug mode:
49321 ticks the float routine
49321 tics the uint32_t routine

So they actually take the same time?

On a CPU with a hardware floating-point unit, I don't think that's necessarily surprising?

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

AScha.3 · ‎2024-04-08

Your first "time2" is

time2=start-time2;

time2 = 0 - systick. Is negative ...ok?

next time2 = old systick - new systick . ~~Also negative~~... ed.

If you feel a post has answered your question, please click "Accept as Solution".

Tesla DeLorean · ‎2024-04-08

Yes SYSTICK down counts and is only 24-bit, and often has a DIV8 prescaler.

However DWT CYCCNT is 32-bit and upcounts processor cycles.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Andrew Neil · ‎2024-04-08

@JLope.11 wrote:
It surprised to me that the systick timer runs backward

It is a down-counter:

https://developer.arm.com/documentation/dui0646/c/Cortex-M7-Peripherals/System-timer--SysTick#:~:text=counts%20down%20from%20the%20reload%20value%20to%20zero

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.