2019-08-30 09:14 AM
I try to measure instruction throughput with the following snippet:
float bench(int tries, int repeat) {
uint32_t t0, t1, d = 0;
unsigned A0, A1, A2, A3, A4, A5, A6, A7, B = 1;
A0 = A1 = A2 = A3 = A4 = A5 = A6 = A7 = 0;
for (int i = 0; i < tries; ++i) {
t0 = DWT->CYCCNT; // read cycle counter
for (int j = 0; j < repeat; j += 8) {
// makes the values unpredicatable to the compiler
asm ("":"+r"(A0), "+r"(A1), "+r"(A2), "+r"(A3), "+r"(A4), "+r"(A5), "+r"(A6), "+r"(A7), "+r"(B));
// Execute the binary operator on independent data
A0 = bin(A0, B);
A1 = bin(A1, B);
A2 = bin(A2, B);
A3 = bin(A3, B);
A4 = bin(A4, B);
A5 = bin(A5, B);
A6 = bin(A6, B);
A7 = bin(A7, B);
// avoid dead store elimination
asm (""::"r"(A0), "r"(A1), "r"(A2), "r"(A3), "r"(A4), "r"(A5), "r"(A6), "r"(A7));
}
t1 = DWT->CYCCNT;
if (t1 - t0 < d || d == 0) d = t1 - t0;
}
return (float)d / (float)repeat;
}
/* ... */
float time = bench(10, 10000);
Where `bin` can be for example `ADD` or `ORR` as written here:
static inline unsigned ADD(unsigned a, unsigned b) {
asm("ADD %0, %1":"+r"(a):"r"(b);
return a;
}
static inline unsigned ORR(unsigned a, unsigned b) {
asm("ORR %0, %1":"+r"(a):"r"(b);
return a;
}
When I measure the throughput of ADD, I get roughly 0.69 cycle per addition which is close to the expected 2 instructions per cycle on this architecture.
However, when I try the same thing with ORR, I get 2.25 cycles per bitwise or. So I'm close to 2 cycles per instruction so about 4 times slower than expected!
I tend to think that bitwise instructions would be eligible to the dual issue, plus would have a latency of 1, but this is clearly not the case in my benchmark.
The assembly generated is correct.
I don't get why I observe such a slowdown for ORR, and I was unable to find any documentation about this.
Any help about this would be really appreciated.
2019-08-30 10:37 AM
Are you sure C did not get into way? Did you observe the disasm? You don't run into data/code/branch alignment/caching issues? Maybe it would be better to code the whole benchmark in asm?
I know of no detailed description of the CM7 pipeline/core. As it is still a microcontroller-oriented core thus severly area-restricted, it may be much simpler than you'd think and significantly less capable than its "full-fledged" "adult" counterparts.
JW
2019-08-31 02:17 AM
Well, I checked the assembly generated and it looks exactly like I expect. In both cases, 32-bit instructions were used so I don't expect the instruction alignment to be different in the 2 cases.
Everything is in register in this benchmark, so there is no data alignment or cache issue. btw, the data and instruction caches are enabled.
I don't think it is useful to code the whole benchmark in assembly as the assembly generated is already like I expect.
The thing that is really surprising is that it can perform 2 additions per cycle (like advertised), but is only able to process 1 bit or every 2 cycles, despite the fact that bitwise or is much simpler than an addition.
2019-08-31 03:11 AM
Interesting... I'm getting 0,69 and 0,75 respectively on STM32F767ZI at 216 MHz with GCC at -O2 and ART, prefetch and I/D caches ON.
2019-09-04 02:49 AM
I tried to completely redo the benchmark from scratch on a new project, and now I measure correct throughput for both ADD and ORR. I still don't understand what happened, but now I consider the issue as solved.
Sorry for the disturbance and thank you for the feedback.