Using Cortex-M7 DSP Instructions

Scott Gravenhorst · ‎2017-11-15

Posted on November 15, 2017 at 22:56

I would like to understand how I can use DSP instructions in the STM32F746NG. I know that I can code them in assembly language using __asm, but I was looking for a more automatic way, such as an optimization option for gcc that can recognize things like FIR or IIR filters. Does that exist?

Note: this post was migrated and contained many threaded conversations, some content may be missing.

AVI-crak · ‎2017-11-19

Posted on November 19, 2017 at 18:15

Interesting details.

In the C programming language, there is no fixed-point symbol, there is no character type definition, there is no syntax highlighting of this symbol - it may soon be, but not now.

In substitutions there is a type definition, for example: typedef int16_t q15_t. For GCC, this means the size of the memory, and nothing more. He does not know how to use this type of data independently, but if he is told, it can.

https://gcc.gnu.org/onlinedocs/gccint/Types.html

https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html#Fixed-Point

Obviously, the combination of this magic can help the recognition of fast DSP operations in the intermediate code of simple LLVM operations. This is not a solution to the correct syntax highlighting, but GCC will no longer try to use simple operations for q15_t, just like for int16_t. Instead, it will use the built-in functions for M0-M3, and the hardware functions for the M4-M7. But it's in theory, I was not able to create a magic combination - too many gaps.

AVI-crak · ‎2017-11-19

Posted on November 20, 2017 at 00:18

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1169.pdf

The library responsible for representing fixed-point types is stdfix.h. This library is located in the GCC itself.

The inclusion of the library causes a conflict with arm_math.h. In addition, it is necessary to configure the types of int_hr_t, int_r_t, int_lr_t, int_hk_t, int_k_t, int_lk_t, and so on. The configuration means explicitly indicating the limits of the types used, the number of bits before the character and after, the value of these bits. There are simple examples, but it's hard for me to understand them.

However, GCC is starting to use something incomprehensible in assembler code, this is a bunch of DSP commands generously diluted by shift operations. Yes, DSP commands have appeared - but it's difficult to understand their logic.

Scott Gravenhorst · ‎2017-11-19

Posted on November 20, 2017 at 01:08

Interesting stuff there. In fact I had considered using fixed point as

I had used it in projects I did for dsPIC (1.15). I was having

performance problems on the STM32F746, so I decided to benchmark the

difference between float and int32_t times. I did this to help decide

whether it's worthwhile moving from float to fixed to increase

performance. I tested only addition because this should be more

difficult for float than multiply. The result of the benchmark was the

int32_t and float execute adds in exactly the same amount of time. In

the long run, my main performance problem was the use of -Ox. From

that, I decided that moving to fixed point was not going to improve the

application's performance and judging from my use of fixed in C in

earlier projects, the code becomes impossible to read. So without a

performance increase, I felt changing to fixed unnecessary and loses a

bit of numerical flexibility.

I've also been in conversation with another DSP developer who has done

commercial projects with this exact same CPU. What he's told me is that

he's never seen gcc produce float DSP CPU instructions placed in a

binary. We also talked about the DSP library and he agreed with my

opinion that the library is just a set of well written DSP functions (he

also said that the functions may not even be all that well written

especially if they are moved from one CPU type to another). At this

point in time, I'm not holding out hope for 'magical' insertion of float

DSP instructions by gcc. What surprises me is that no other STM32

experts have jumped in on our discussion to settle this question. I

also know another person who suggested that gcc could do this, but he's

a college student with only one project worth of experience. This is a

rather small statistical samples of opinion, so I'm not sure how

significant it is.

For me, the question is simple - does gcc for arm-eabi have the ability

to insert float DSP instructions or not. I'm getting the feeling the

answer is 'no', but it would be nice to see something from ARM or Gnu

GCC to confirm it one way or the other. Banging out these 'test'

programs isn't helping me finish my projects... It seems odd that there

are (apparently) so few people interested in this.

Tesla DeLorean · ‎2017-11-19

Posted on November 20, 2017 at 02:06

>>It seems odd that there are (apparently) so few people interested in this.

This isn't a high traffic forum, and DSP is a bit niche, those with much talent would be writing in assembler with a strong understanding of the pipeline, parallel operations and the algorithm. My needs are more 'Real Math' than DSP.

Compilers aren't good at recognizing 'algorithms', though could be coded to recognize sequences or templates. Intel built some clever stuff for the SSE/SIMD, where it could translate arrayed math operations. Compilers are also not good with details of the math and how the numbers might challenge the precision vs magnitude constraints. Folding and reordering the math can also be problematic.

The original STM32F7 parts had the single precision FPU-S, thankfully some of the newer family members have the single/double precision FPU-D, so now we can do some 'Real Math' rather than something a tad faster than the M4F. Still the FPU in these parts is not anywhere near as effective as the 80x87 and 6888x devices, and there still needs to be a lot of library code to polish the rough edges.The ARM FPU don't hold intermediate values at higher precision levels.

Again assembler and a knowledge of where the precision and order is critical can allow for the most effective use of the hardware.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Scott Gravenhorst · ‎2017-11-19

Posted on November 20, 2017 at 02:42

>> Again assembler and a knowledge of where the precision and order is critical can allow of the most effective use of the hardware.

I fully agree. This all started for me because I didn't understand optimization for gcc (I'm pretty new to using gcc for embedded systems). My main problem is now resolved and there are plenty of clocks for features I want to add. However, this started a line of thinking about 'what if I really need more performance?' and so I thought I'd pursue the 'magical' gcc thing especially if it works. If it comes down to a need for more speed, I'll seriously consider writing at least certain functions in assembly language. Then again, there always seem to be new faster devices.

waclawek.jan · ‎2017-11-19

Posted on November 20, 2017 at 06:10

My colleague who does the core DSP spends more time on algorithmization of the problem and fine-tuning the solution to the actual requirements that with finding the maximum raw power of given setup (including compiler). DSP is all about that, except of quiche eaters who are lucky to have way more power at hand than actually needed for the task. Maximum optimization setting of compiler with occasional tweaking of the particulars beyond Ox based on observing the result is a natural prerequisite in our view; any convenience of the debugger (i.e. the person who fixes his own bugs) is irrelevant - besides the algorithmic errors are mostly to be fixed in that algorithmization stage, often done in high-level environments (think Matlab and kin). 'Manual' fixed-point with i.e. with no native support from the compilation environment is also seen as a natural must ever since. asm may be an option, but IMHO on 32-bitters it would yield *significant* improvement over well-crafted C only in very specific cases.

I see gcc insert multiple-accumulate where appropriate and won't consider that as DSP-specific. Note that CM7 is a superscalar dual-issue processor as far as 32-bit integer go but not the float/double coprocessors; if you found int32 to be of the same performance than float, then you do some serious load/store rather than math and then should think more about that.

As Clive said, general compilers are not very good in guessing the general algorithm and fine tuning onto it; that would be unnecessarily complicated and in many cases straight counterproductive.

The ultimate answer to your original question is in studying and eventually modifying gcc sources. gcc is open source, and the possibility to do these is one of the few positive consequences; however, you don't sound to be ready to go that way ;)

JW

AVI-crak · ‎2017-11-20

Posted on November 20, 2017 at 08:42

Turvey.Clive.002

Self-assembling a small section of code in assembly language is not the goal. This stage has already been passed. Now there is interest in making GCC use DSP commands on its own. The main problem remains the statement of the problem itself GCC - with what he needs to work. And if possible without conflict.

Scott Gravenhorst · ‎2017-11-20

Posted on November 20, 2017 at 15:40

Thank you all for this very informative thread, it has helped me to understand the situation much better. At this point, I'm not motivated to modify gcc - not sure I could actually do an effective job of that anyway. I understand now that gcc can in fact insert 'special' instructions, but this is not directed by an algorithm to 'divine' DSP methods written in C, rather it's a simple matter of optimization using the available instructions. For my own work, I will write the best C code I can and perhaps I will write functions in assembly language where it can have a positive effect on execution time. Again I thank all contributors to this thread.

Scott Gravenhorst · ‎2018-01-08

Posted on January 08, 2018 at 16:32

I've done more work with this and today discovered that a filter such as:

z = a0 * in + b1 * z

does cause gcc to generate one vfma.f32 instruction under the right conditions. I've not fully decode the disassembly output, but it looks like it is part of the compiled filter. I also discovered that the vfma.f32 instruction is used only if the filter is inside a loop. I did not try this with arrays, just single variables as shown above.

So it appears that under certain conditions, the gcc compiler will use the DSP instructions when it sees fit. Among those conditions are that -O2 or -O3 optimization must be specified.