Using Cortex-M7 DSP Instructions

Scott Gravenhorst · ‎2017-11-15

Posted on November 15, 2017 at 22:56

I would like to understand how I can use DSP instructions in the STM32F746NG. I know that I can code them in assembly language using __asm, but I was looking for a more automatic way, such as an optimization option for gcc that can recognize things like FIR or IIR filters. Does that exist?

Note: this post was migrated and contained many threaded conversations, some content may be missing.

Scott Gravenhorst · ‎2018-01-08

Posted on January 08, 2018 at 16:32

I've done more work with this and today discovered that a filter such as:

z = a0 * in + b1 * z

does cause gcc to generate one vfma.f32 instruction under the right conditions. I've not fully decode the disassembly output, but it looks like it is part of the compiled filter. I also discovered that the vfma.f32 instruction is used only if the filter is inside a loop. I did not try this with arrays, just single variables as shown above.

So it appears that under certain conditions, the gcc compiler will use the DSP instructions when it sees fit. Among those conditions are that -O2 or -O3 optimization must be specified.

View solution in original post

waclawek.jan · ‎2017-11-15

Posted on November 16, 2017 at 00:51

Not, AFAIK.

There are libraries available straight from ARM, supposedly optimal. See AN4841.

JW

AVI-crak · ‎2017-11-15

Posted on November 16, 2017 at 03:15

GCC can independently use the built-in DSP functions. But for this it is necessary to write the code explicitly. Instead of writing the names of variables in the explicit simple form a + = b * c, it is necessary to use their addresses. Approximately so a [x] + = b [x] * c [x]. The operation must be cyclic - to exclude optimization by the GCC. Otherwise, it loads the values of the variables into registers, and performs separate operations - it's so much easier for him.

Almost all DSP functions work with indirect addressing, in the sense that an address is used to load data - but not the data already in the registers. This must be taken into account.

The use of fractional numbers is a misunderstanding. GCC does not know how to work with types q15_t in automatic mode, it does not even understand what it is. Therefore, all operations with such numbers must be performed through built-in libraries. That does not always give a benefit in speed and quality.

A good result can be achieved in the case of self-assembly algorithm, it is almost always a pure assembler.

Scott Gravenhorst · ‎2017-11-16

Posted on November 16, 2017 at 17:42

I tried this just now. Here are the loops I wrote (all variables except j are float, j is uint32_t):

for (j=0; j<4; j++)

{

z[j] = a0[j] * input[j] + b1[j] * z[j];

}

and this:

for (j=0; j<4; j++)

{

z[j] += a0[j] * input[j];

}

I used:

arm-none-eabi-objdump -d -S main.o

to get an assembly listing. Both of the above loops compiled without error, but there were no DSP instructions found.

AVI-crak · ‎2017-11-17

Posted on November 17, 2017 at 18:42

gcc-arm-none-eabi-6-2017-q2 ?

Scott Gravenhorst · ‎2017-11-17

Posted on November 17, 2017 at 19:09

arm-none-eabi-gcc-7.1.0

This is running under 64 bit Fedora 25 (which I update daily).

I assume I should be looking for vmla.f32 instructions (this is the floating point single precision multiply accumulate instruction)?

AVI-crak · ‎2017-11-17

Posted on November 17, 2017 at 20:39

I could not find in my code the instructions vmla.f32, but there is a lot of vfma.f32. Previously, everything was wonderful. And now I see the problem.

Scott Gravenhorst · ‎2017-11-17

Posted on November 17, 2017 at 21:05

I had saved the assembly listing and grep-ed for vfma.f32 and found none. In fact, there are only four .f32 instructions seen - vmov.f32, vsub.f32, vmul.f32 and vadd.f32.

What do you see as the problem?

I am still looking at those two instructions to see what the difference is. It looks like vfma does not round before accumulation, but vmla doesn't mention rounding in the instruction description in the programming reference PDF.

AVI-crak · ‎2017-11-17

Posted on November 17, 2017 at 23:34

I have a lot of floating-point math, most of this code was written by other people. By the law of probability, the vmla command must be used at least once.

I'm also looking for a more accurate definition of the commands used. PM0253 is almost completely copied from the ARM website, there too everything is in a fog.

Until now, I was completely satisfied with the work of GCC ...

Scott Gravenhorst · ‎2017-11-17

Posted on November 17, 2017 at 23:58

One thing I would like to try is to use a function from the DSP library that could use a MAC and see if that results in vmla or vfma instructions. Looking at the DSP library code, I don't see how that magic can happen because the code is entirely C with absolutely no __asm statements. If that code compiles using real DSP instructions (and I do not know that it does), then I want to know what is being done in the C code that causes gcc to do that. I don't see anything special in terms of defined macros or other details that would evoke such a response from gcc.