Code Optimization

vbk22398 · ‎2024-09-23

Hi, I am using 3 different STM32MCU's for our various projects, which include STM32F423, STM32L496, STM32F407 etc., I am using HAL Layers for this code in general. Now I need to create HAL Layers for these boards only, and I need to reduce the overheads caused by generic HAL code generated by cube ide. Suggest me some solutions.! what is the efficient and quick way to do this. Register level code is too much time consuming and very error prone and at the same time, I need to balance this for overheads also. Kindly help!

unsigned_char_array · ‎2024-10-09

@vbk22398 wrote:
My superior wants me to do it Bare Metal in register level, but I feel it is overwhelming as there are lots of registers and bit fields to be concerned about.

Why does your superior want that? Performance reasons? Or because your superior wants all code to be written in house without any third party code?

STM32CubeMX has HAL and LL.
LL is Low Level and basically only uses macros or inline functions to directly access to registers, while HAL uses functions. In STM32CubeMX you can select per peripheral if you want to use HAL or LL.
LL is less portable and harder to use. But I would call that bare metal.
My suggestion is to first get your code to work and then one-by-one rewrite the provided functions only if needed.

@vbk22398 wrote:
Also I don't know how to find "the things which have the biggest impact on speed."

Profiling. Measure the speed. One way is to set an IO pin before calling a function and clearing it afterwards. You can use a Logic Analyzer or an oscilloscope to measure the duration of the function. Using different IO pins for different functions can give you a nice visual overview of the timing. You can also use timers to measure duration of functions.
Generally you want to avoid busy waiting for things like peripherals. Example:
Uart sends "Hello world!" at 9600baud 1 stop bits, no parity. This should take 12.5 milliseconds. Usually the uart reports done while the last byte is being send so it can report it is done a little sooner. Waiting for the uart to finish at the end of the send function results in the function to take about 12.5 milliseconds. But you can also check if it is done sending with a separate function. You can do other things in the mean time.

Kudo posts if you have the same problem and kudo replies if the solution works.
Click "Accept as Solution" if a reply solved your problem. If no solution was posted please answer with your own.

View solution in original post

Andrew Neil · ‎2024-09-23

@vbk22398 wrote:
I need to reduce the overheads caused by generic HAL code generated by cube ide.

What "overheads", exactly, are you seeing?

Too much code space consumed?
Too much execution time consumed?
other?

What compiler optimisation level are you currently using?

mƎALLEm · ‎2024-09-23

Hello,

@vbk22398 wrote:

I need to reduce the overheads caused by generic HAL code generated by cube ide. Suggest me some solutions.!

May be using LL (Low Layer driver)? But not all peripherals have LL driver. Need to check the HAL folder of your product.

From development timing constraint, it will be time consuming.. maybe less than recreating the wheel by performing your own register access ..

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

vbk22398 · ‎2024-09-23

Yeah it is code space and execution time!

TDK · ‎2024-09-23

You can reduce code size by selecting the Release configuration, which will also speed up the code. That's your only option.

HAL should take up a small minority of the FLASH for those chips.

If you feel a post has answered your question, please click "Accept as Solution".

AScha.3 · ‎2024-09-23

Hi,

make a new project, only with clock tree and init for the hardware, you use;

but without USB, FAT etc. , to see, how much space HAL needing. (- without USB stack, fatfs etc.- this is not "HAL")

On a G474 it needs about 3,34 KB (0,65 %) of the 512KB flash ( with -O2).

If you think, this is a waste of memory...write the cpu init yourself - and be happy (or not..) . :)

+

Debug or Release changes nothing about code size, except you set different optimizer for each.

(Just "release" or "run" dont start the debugger, after flashing the cpu.)

If you want the smallest code possible, use -Os (size optimized).

And the not used HAL libs anyway are never linked to the program, so no need to think about.

(Except you work on a cpu with 16KB flash - but this is another problem, if 50ct more for the cpu are important.)

If you feel a post has answered your question, please click "Accept as Solution".

Andrew Neil · ‎2024-09-23

@TDK wrote:
That's your only option.

There are opportunities to optimise the HAL stuff; eg, consider the HAL_UART_Transmit/Receive_ functions:

At runtime, every single call to HAL_UART_Transmit/Receive_ checks the config to see if it's in 9-bit mode:

https://community.st.com/t5/stm32-mcus-products/stm32-uart-using-interrupt-transmitts-by-byte-or-data-elements/m-p/723323/highlight/true#M261621

If you never use 9-bit mode, you could remove all those tests.

That (among other things) might well give a useful speed-up in UART performance.

On these chips, unlikely to give a significant reduction in code size, though.

@vbk22398 that's why I asked. "What 'overheads', exactly, are you seeing?"

Have you done any analysis to find specifically what bloat and/or bottlenecks you have? Or is this just some general ~~nothing~~ notion that HAL code must be slow & bloated?

You really need to quantify this first to determine whether the effort is actually worth it.

vbk22398 · ‎2024-10-08

@Andrew Neil Sir, Thanks for the time and reply. My requirement is as follows. We are designing a single header file and a single driver file for all the sensors which we have interfaced with STM32. Basically this will itself include all the programs interfaced but these will be wrapper functions. So, we fear that this single file concept will take all of our space if we do it in HAL Layer. The main function will be very minimal and the rest of the things will be taken care by the code running behind the main.

Andrew Neil · ‎2024-10-08

@vbk22398 wrote:
We are designing a single header file and a single driver file for all the sensors .

Why? What is the purpose of that?

That doesn't sound a very wise move.

@vbk22398 wrote:
we fear that this single file concept will take all of our space if we do it in HAL Layer.

So don't do it, then!

Whether it's a single file or multiple files probably won't make much difference.

One advantage of using multiple files is that it can give the Linker extra scope to omit unused sections.

Again, have you done any analysis to find specifically what bloat and/or bottlenecks you may or may not have? Or is this just some general notion that HAL code must be slow & bloated?

You really need to quantify this first to determine whether the effort is actually worth it.

unsigned_char_array · ‎2024-10-08

What are the hard and soft requirements for the performance (timing), power, and resource usage (RAM+ROM)? Hard real-time? Soft real-time? Best average performance? Timing diagrams are your friend!
Profiling, Profiling, Profiling: Measure your code's timing/performance.
If needed make optimizations.

Possible optimizations:

Architecture/structure of the code. (RTOS-/thread-based, loop-based, interrupt-based, limit call-depth, etc.)
Algorithms (different sorting algorithm for instance)
Compile optimizations
Link optimizations
micro-optimizations (restructure loops, reorder struct members, inline functions)
hardware acceleration (FPU, DMA, CRC peripheral, data cache, instruction cache, special instructions, etc.)
Tradeoffs

Possible tradeoffs:

compile time vs runtime. You can calculate some things at compile time (if you use C++ you can use constexpr, if you use C you can use macros and inline functions(hopefully the compiler will calculate those things during compile time)).
Boot time vs run-time performance. You can do a lot of calculations such as a calibration at boot time.
ROM vs time. Sometimes lookup tables can speed things up.
RAM vs time. cache results of certain calculations so they only have to be done once.
RAM vs time/FLASH, run code from RAM instead of from FLASH.
precision vs time. What precision is required for your calculation? Sometimes float can be used instead of double. Or use fixed point instead of floating point.
Etc.

Kudo posts if you have the same problem and kudo replies if the solution works.
Click "Accept as Solution" if a reply solved your problem. If no solution was posted please answer with your own.