Resolved! I recently ported my STM32F413 (run at 100Mhz) application to a STM32H7A3 (run at 280-Mhz). But my DSP routine does not run 2.8x faster. Anyone has any idea how this happened?
I measured the time took to run a length 127 Multiply–accumulate routine by togging a GPIO pin before and after the arm_dot_prod_q15() routine. And then I can measure pulse width by a scope. It was 7us on the STM32F413 and 6us on the STM32H7A3.