Optimized Multiplies for Cosmic Compiler?

paulrbryson · ‎2017-11-11

Posted on November 11, 2017 at 23:27

Are there optimized basic math functions available for the STM8 such as; 8 bit by 8 bit multiply and 8 bit by 16 bit multiply and similar?

My application has a time critical 8 bit by 16 bit multiply. The Cosmic compiler seems to always default to a 16 bit by 16 bit multiply, which is slower.I wrote an inline assembly macro that runs in about 2/3 the time of the compiler's output; but it was very tedious to write. I would rather not do this for every math function.

Any helpful information would be appreciated.

I include my macro here, in case anyone else finds it useful:

// macro to perform an optimized UI8 by UI16 multiply: uint16_t

RESULT_UI16 = (uint8_t)X_UI8 * (uint16_t)Y_UI16;

// Note: All arguments must be declared '@tiny'

// macro assumes that no overflow occurs

// '_asm()' will load an 8 bit argument in reg A or a 16 argument into reg X

#define MULT_8x16(X_UI8, Y_UI16, RESULT_UI16) {\

_asm('LDW Y,X\n SWAPW X\n',(uint16_t)Y_UI16);\

_asm('MUL X,A\n SWAPW X\n PUSHW X\n LDW X,Y\n', (uint8_t)X_UI8);\

_asm('MUL X,A\n ADDW X,($1,SP)\n POPW Y\n');\

_asm('CLRW Y\n LD YL,A\n LDW (Y),X\n',(@tiny uint16_t*)(&RESULT_UI16));\

}

#compiler #math

Note: this post was migrated and contained many threaded conversations, some content may be missing.

Philipp Krause · ‎2017-11-14

Posted on November 14, 2017 at 14:20

When I did the

http://www.colecovision.eu/stm8/compilers.shtml

in mid-2016, IAR generated the fastest (but also biggest) code for the Whetstone, Dhrystone and Coremark benchmarks. Some development versions of SDCC can generate even faster code for Dhystone and Coremark (but not Whetstone); however, there has not been a SDCC release since SDCC 3.6.0 in mid-2016.

Typically, which compiler optimizes best will depend on your code and optimization goals.

Philipp

Vyacheslav Azarov · ‎2017-11-14

Posted on November 14, 2017 at 14:47

Excuse me. I do not wanted make advertise for IAR. Their computational logic has too much context, in the form of virtual registers. In some cases this is not desirable. But their implementation is excellent.

paulrbryson · ‎2017-11-14

Posted on November 14, 2017 at 14:56

1. My original post was not a criticism of the compiler; but a call for information.

2. You and I will have to disagree. I think that 2/3 cycles for 8x16->16 vs 16x16->16 IS 'much faster'.

3. You say 8x16 'is not common'; I would say then that the code you looked at had little to no fixed point math which is very common in embedded systems and I suspect that the code you looked at had little to do with 8 bit processors - i.e. an RTOS would not likely be found on an STM8 and functions like trig and log are avoided if at all possible. And 'arrays of structures' in 1K-2K of RAM? Really?

I will repeat what I said before: 'Why would 8x16 multiplies be less common'? Embedded systems involve real world systems and in the real world, numbers come in every possible scale and precision. Yes, you can always use higher precision but in an 8 bit CPU you will quickly run out of RAM, or CPU cycles, or both.

3. And finally, who cares if it is less common - it is still common enough. Also, processing captured data is likely to be the most time critical part of any embedded code.

paulrbryson · ‎2017-11-14

Posted on November 14, 2017 at 16:29

UPDATE:

This new macro seems to provide the same performance on the Cosmic compiler as the '_asm()' macro in my original post but is more portable. And I think my original time measurements were off. Both of these macros may be as much as 2 times faster than the compiler standard output.

// macro to perform an optimized UI8 by UI16 multiply: uint16_t RESULT_UI16 = (uint8_t)X_UI8 * (uint16_t)Y_UI16;

// Note: All arguments should be declared '@tiny'

// macro assumes that no overflow occurs

#define MULT_8x16(X_UI8,Y_UI16,RESULT_UI16) {\

RESULT_UI16

= (((uint8_t)(

(uint16_t)

Y_UI16

>>8) *

(uint8_t)

X_UI8

)<<8) + ((uint8_t)

Y_UI16

*

(uint8_t)

X_UI8

); \

}

Philipp Krause · ‎2017-11-14

Posted on November 14, 2017 at 15:45

Yes. Arrays of struct are common. The structs might not be that large, and the arrays might not have that many elements, but ther are arrays of structs.

To quickly get some data, I just compiled Atomthreads (AFAIK the only RTOS that has ports for all 4 STM8 C compilers) from git using SDCC, with --fverbose-asm (added to the options that the Atomthreads developers chose).

Using grep on the generated asm I see a total of 32 multiplications for which code is generated or a support function is called (I guess some more multiplications have already been optimized away earlier, e.g. multiplication by power-of-2 into shift, etc).

There are 4 calls to the support function for 32x32->32 multiplications.

There are 28 places where code is generated for a multiplication of 8 or 16 bits by a constant with a result of 8 or 16 bits. I didn't check all of them, but looked only at half a dozen. All of those turned out ot be from accesses to arrays of structs.

Apparently, there are 0 8x16->16 multiplications with non-constant operands.

Of course the multiplications in the RTOS will often not be representative of the multiplications in the whole application. And the type and number of multiplications will vary on a per-application basis.

But if your code depends heavily on the speed of 8x16->16 multiplication you (and others for which that happens) need to make compiler develoeprs aware of that.

Philipp

paulrbryson · ‎2017-11-14

Posted on November 14, 2017 at 16:41

'The structs might not be that large, and the arrays might not have that many elements, but ther are arrays of structs.'

If you had an array of 300 struct's that were 11 bytes in size - wouldn't an 8x16->16 multiply be the most efficient way of indexing them?

luca239955_stm1_st · ‎2017-11-16

Posted on November 16, 2017 at 11:18

Hello,

this looks like a good optimization to implement: we'll check some details are report back here soon.

Regards,

Luca (Cosmic)

luca239955_stm1_st · ‎2017-11-16

Posted on November 16, 2017 at 21:20

This optimization will be implemented in the next release of the compiler (no due date yet, probably a couple of months) in the form of a library routine that comes down almost exactly to the C macro mentioned above: this means that for absolute best speed the macro will still be the best solution (because it is inlined), but using it too many times will make the code bigger. Conversely it also means that a code that did not need this much speed for this special multiplication will end up a few bytes bigger than before (but this can be avoided using casts to force the 16x16 multiplication).

As to why we did not implement this before, Philipp already gave the biggest part of the answer: since this is not the most used kind of operation and no one asked for it before, we just preferred to favor code size rather then speed: this used to be the standard choice for 8 bit micros in the past, but we see it slowly changing to a more balanced approach between size and speed, so if there are other suggestions for similar improvments don't hesitate to let us know and we will evaluate on a case by case basis.

paulrbryson · ‎2017-11-16

Posted on November 16, 2017 at 23:04

That's great, Luca, thank you.

BTW, do you know of any white papers or app notes about writing faster code with Cosmic Compiler?

Vyacheslav Azarov · ‎2017-11-16

Posted on November 17, 2017 at 07:04

It would be nice if there was an opportunity to generate the optimal code for multiplication by constants.