cancel
Showing results for 
Search instead for 
Did you mean: 

GPIO data bus speed optimization/recommendation(s)

Mecanix
Senior

Greetings embed Professionals

Recently taken onto STM32 and getting along quite well so far! Regretting for not having done so earlier so plenty of catch up and "learning to do" (ps. mechanical engineering background here).

Learning in particular; I'd like to investigate how professionals would go about implementing an optimized data bus for lower end mcu (e.g. STM32F1x/budget limitations). Below is what I have coded so far, which works great btw, however I feel the performance is quite lacking, something, somewhere.

Is there any other way(s) I could get 16-bit across faster???

ps. The #define DATAOUT(i) seems to be the bottleneck

Thanks for any recomendation/alternative ways you may have, I'd sincerely appreciate a hand

//LCD.H
#define DATAOUT(i) { \
GPIOA->BSRR = 0b0001000011111111 << 16; \
GPIOA->BSRR = (((i) & (1<<0)) << 0) \
			   | (((i) & (1<<1)) << 0) \
			   | (((i) & (1<<2)) << 0) \
			   | (((i) & (1<<3)) << 0) \
			   | (((i) & (1<<4)) << 0) \
			   | (((i) & (1<<5)) << 0) \
			   | (((i) & (1<<6)) << 0) \
			   | (((i) & (1<<7)) << 0) \
			   | (((i) & (1<<12)) << 0); \
GPIOB->BSRR = 0b1110111100000000 << 16; \
GPIOB->BSRR = (((i) & (1<<8)) << 0) \
			   | (((i) & (1<<9)) << 0) \
			   | (((i) & (1<<10)) << 0) \
			   | (((i) & (1<<11)) << 0) \
			   | (((i) & (1<<13)) << 0) \
			   | (((i) & (1<<14)) << 0) \
			   | (((i) & (1<<15)) << 0); \
}
 
//LCD.C
static void WriteData(uint16_t data)
{
	//elapsed_time_start(0);
	DATAOUT(data);		//elapsed_time@72Mhz -> 81 ticks(?!)
	//elapsed_time_stop(0);
}

1 ACCEPTED SOLUTION

Accepted Solutions
TDK
Guru

81 ticks seems like a lot. But this is going to be very dependent on your optimization settings. Obviously some of those operations like "<< 0" in there can be removed.

You can group a bunch of those operations into one, and only access BSRR once. It's volatile, so the CPU can't optimize away accesses. That is, assuming you only want to set BSRR according to the bits in the argument, and not toggle it twice.

const uint32_t mask = 0b0001000011111111;
GPIOA->BSRR = (mask << 16) | (mask & i);

I'd do that and then look at the disassembly to see what the compiler is doing.

Also consider using DMA to shift values into BSRR. That is assuming you can calculate those in advance. Otherwise something like this can suffice.

If you feel a post has answered your question, please click "Accept as Solution".

View solution in original post

9 REPLIES 9
TDK
Guru

81 ticks seems like a lot. But this is going to be very dependent on your optimization settings. Obviously some of those operations like "<< 0" in there can be removed.

You can group a bunch of those operations into one, and only access BSRR once. It's volatile, so the CPU can't optimize away accesses. That is, assuming you only want to set BSRR according to the bits in the argument, and not toggle it twice.

const uint32_t mask = 0b0001000011111111;
GPIOA->BSRR = (mask << 16) | (mask & i);

I'd do that and then look at the disassembly to see what the compiler is doing.

Also consider using DMA to shift values into BSRR. That is assuming you can calculate those in advance. Otherwise something like this can suffice.

If you feel a post has answered your question, please click "Accept as Solution".
berendi
Principal

Wow, from 81 tickies down to 69. Must be valentine day, is this the best you can do?

In all seriousness, pretty impressive squeeze. I sincerely appreciate the time you took to read & reply. Top notch, ty

See below; am I doing this correctly/as you've anticipated?

//LCD.H
#define DATAOUT(i) { \
		GPIOA->BSRR = (maskA << 16) | (maskA & i); \
		GPIOB->BSRR = (maskB << 16) | (maskB & i); \
}
 
//LCD.C
const uint32_t maskA = 0b0001000011111111;
const uint32_t maskB = 0b1110111100000000;
 
static void WriteData(uint16_t data)
{
	elapsed_time_start(0);
	DATAOUT(data);		//@72Mhz - > 69 ticks (!!!)
	elapsed_time_stop(0);
}

@TDK "consider using DMA to shift values into BSRR"

Judging from preliminary readings I have yet found evidence that DMA have access to gpio on the low'er end series (F1). I'm still looking into this and see if this can be done a way or another. Regardless, the improvement you brought is significant already, thanks for that (owe you one!).

TDK
Guru

> See below; am I doing this correctly/as you've anticipated?

Yes.

> Wow, from 81 tickies down to 69.

Nice. That surely should execute in less than 69 ticks. My guess is there is significant overhead in the elapsed_time_start/stop you're using to call it.

Measure this:

elapsed_time_start(0);
	DATAOUT(data);
	DATAOUT(data);
	elapsed_time_stop(0);

And then take the difference between that and your previous result to find the real cost of DATAOUT.

You're optimizing on full optimization (-O3 or similar), yes?

Did you look at the disassembly?

> Judging from preliminary readings I have yet found evidence that DMA have access to gpio on the low'er end series (F1).

Ahh, you may be right on this. I don't have much experience with the F1 series.

If you feel a post has answered your question, please click "Accept as Solution".

Note the temporary variable in the answer by @TDK​, it should shave off a few more cycles.

Did you enable compiler optimizations?​

Did you take the overhead of starting and stopping​ the counter into account?

..."My guess is there is significant overhead in the elapsed_time_start/stop"

Well spotted, unfortunately there is but comes with nice min/max features I quickly needed (not mine, btw).

Ref: https://www.embedded-computing.com/hardware/measuring-code-execution-time-on-arm-cortex-m-mcus

Need to keep this in place to benchmark from now on :\ Started at 81, see diff below!

elapsed_time_start(0);
	DATAOUT(data);
	DATAOUT(data);
elapsed_time_stop(0); // -> 96 ticks

..."You're optimizing on full optimization (-O3 or similar), yes?"

Trying(!) to implement a RGB565 16-bit IPS display with CTP onto a limited $15 max-out budget project, hence why I can't swing in the F4/H7 series. Meant to return real-time data from a load cell with basic HMI functionalities (set/reset/rel, etc).

Let me figure out how to get the disassembly to speak anything I can understand, might take tomorrow to learn how to properly debug those bad boyz. Bet adv debugging skills would be a good starting point 😉

Just found out what @TDK​ meant by "-O3" now. That did sped up the whole thing by, like, +50%.

Thanks for the quick note on that one @berendi​, had me researching on the topic of optimization to figure out what you guys meant by that.

Nice forum/members, cheers guys!!!

Mecanix
Senior

Quick update:

Just realized what @TDK​ meant by -O3. Down to 47 ticks now, from 81. All in all; roughly 20ms to draw a full 240X320 onto a 16-bit interface, or 50fps if you prefer. Insane considering this runs off a $2 mcu...

I'll take out the -O3 setting until I clearly understand what sort of magic it does, feels as if the kit will blow up as it is lol

Long life to this forum and their generous contributors!!! Sincerely appreciated guys, you fixed it.