weird instruction throughput on STM32F756

lemaitre · ‎2019-08-30

I try to measure instruction throughput with the following snippet:

float bench(int tries, int repeat) {
  uint32_t t0, t1, d = 0;
  unsigned A0, A1, A2, A3, A4, A5, A6, A7, B = 1;
  A0 = A1 = A2 = A3 = A4 = A5 = A6 = A7 = 0;
 
  for (int i = 0; i < tries; ++i) {
    t0 = DWT->CYCCNT; // read cycle counter
 
    for (int j = 0; j < repeat; j += 8) {
      // makes the values unpredicatable to the compiler
      asm ("":"+r"(A0), "+r"(A1), "+r"(A2), "+r"(A3), "+r"(A4), "+r"(A5), "+r"(A6), "+r"(A7), "+r"(B));
 
      // Execute the binary operator on independent data
      A0 = bin(A0, B);
      A1 = bin(A1, B);
      A2 = bin(A2, B);
      A3 = bin(A3, B);
      A4 = bin(A4, B);
      A5 = bin(A5, B);
      A6 = bin(A6, B);
      A7 = bin(A7, B);
 
      // avoid dead store elimination
      asm (""::"r"(A0), "r"(A1), "r"(A2), "r"(A3), "r"(A4), "r"(A5), "r"(A6), "r"(A7));
    }
 
    t1 = DWT->CYCCNT;
    if (t1 - t0 < d || d == 0) d = t1 - t0;
  }
  return (float)d / (float)repeat;
}
 
/* ... */
float time = bench(10, 10000);

Where `bin` can be for example `ADD` or `ORR` as written here:

static inline unsigned ADD(unsigned a, unsigned b) {
  asm("ADD %0, %1":"+r"(a):"r"(b);
  return a;
}
static inline unsigned ORR(unsigned a, unsigned b) {
  asm("ORR %0, %1":"+r"(a):"r"(b);
  return a;
}

When I measure the throughput of ADD, I get roughly 0.69 cycle per addition which is close to the expected 2 instructions per cycle on this architecture.

However, when I try the same thing with ORR, I get 2.25 cycles per bitwise or. So I'm close to 2 cycles per instruction so about 4 times slower than expected!

I tend to think that bitwise instructions would be eligible to the dual issue, plus would have a latency of 1, but this is clearly not the case in my benchmark.

The assembly generated is correct.

I don't get why I observe such a slowdown for ORR, and I was unable to find any documentation about this.

Any help about this would be really appreciated.

waclawek.jan · ‎2019-08-30

Are you sure C did not get into way? Did you observe the disasm? You don't run into data/code/branch alignment/caching issues? Maybe it would be better to code the whole benchmark in asm?

I know of no detailed description of the CM7 pipeline/core. As it is still a microcontroller-oriented core thus severly area-restricted, it may be much simpler than you'd think and significantly less capable than its "full-fledged" "adult" counterparts.

JW

lemaitre · ‎2019-08-31

Well, I checked the assembly generated and it looks exactly like I expect. In both cases, 32-bit instructions were used so I don't expect the instruction alignment to be different in the 2 cases.

Everything is in register in this benchmark, so there is no data alignment or cache issue. btw, the data and instruction caches are enabled.

I don't think it is useful to code the whole benchmark in assembly as the assembly generated is already like I expect.

The thing that is really surprising is that it can perform 2 additions per cycle (like advertised), but is only able to process 1 bit or every 2 cycles, despite the fact that bitwise or is much simpler than an addition.

Piranha · ‎2019-08-31

Interesting... I'm getting 0,69 and 0,75 respectively on STM32F767ZI at 216 MHz with GCC at -O2 and ART, prefetch and I/D caches ON.

lemaitre · ‎2019-09-04

I tried to completely redo the benchmark from scratch on a new project, and now I measure correct throughput for both ADD and ORR. I still don't understand what happened, but now I consider the issue as solved.

Sorry for the disturbance and thank you for the feedback.