Issue with the execution time of NOP instruction [STM32F746G-DISCO]

Ciavolino.Giuseppe · ‎2017-07-26

Posted on July 27, 2017 at 02:34

hello to everyone,

before starting a project that involves digital signal processing, I'm doing some tests with the stm32f746g-DISCO

to evaluate the capabilities of the board.

In particular, i've measured (toggling the GPIOI_PIN_1) the execution time of NOP : 60ns.

I'm using the CubeMx and I've properly set up the clock configuration to run at 216Mhz (maximum frequency).

Also, i've enabled in the 'Cortex_M7 Configuration' section: TCM Interface, ART ACCELERATOR, Instruction Prefetch,CPU ICache and CPU DCache.

I'm a little bit upset, because 60ns for a NOP is in contrast with the idea of a system core clock that runs at 216 Mhz.

I'm doing something wrong, I'm sure, but I really don't understand where.I've checked the RCC registers and the content is coherent

with the code generated with the CubeMx.

Is there any possibilities that I'm in error? In the documents related (datasheet, reference manual, programming manual etc.)

there isn't the information that i'm searching.

With pipelining, one cycle machine should match one cycle for instruction...so why this happen?

Sorry for the bad English, this is the first time that i post on an international forum..

Thanks for the attention.

#cortex-m #arm #stm32f7 #stm32-cube-mx #cycle-machine #nop #execution-time

Danish1 · ‎2017-07-27

Posted on July 27, 2017 at 11:04

How did you measure the time of a NOP?

How strongly do you know that you are measuring just the time of the NOP and not all the overheads?

If you're doing something like (pseudo-code)

while (1) { GPIOI_PIN_1 = !GPIOI_PIN_1; NOP; }

Then the NOP is the _least_ of the things that take time.

You've got the overhead of the jump to make the loop. With a pipelined processor, this can be several cycles.

And (much more significantly) the overhead of reading the port, modifying the value and then writing back the value.

ST did a very good on-line course on stm32f7. I strongly recommend that you read the slides even if you don't actually get the hardware and follow it yourself.

Cortex M7 has dual-issue so it can execute two (non-interfering) instructions simultaneously.

I suppose what you could do is make your test loop:

while (1) {

GPIOI_PIN_1 = !GPIOI_PIN_1; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP; NOP;

}

And then see how the toggle HALF-PERIOD (not frequency) depends on the number of NOPs.

But watch out - an optimiser might realise that the NOPs do nothing and remove them. So do look at the code produced by your compiler before drawing any conclusions.

Hope this helps,

Danish

Jan Waclawek · ‎2017-07-27

Posted on July 27, 2017 at 11:13

i've measured (toggling the GPIOI_PIN_1) the execution time of NOP

You've measured the execution time of NOP, plus execution time of the instructions toggling the pin, plus whatever instruction inserted by C compiler (unless you used asm), plus time needed to fetch the instructions, plus time needed to propagate the toggle write from processor through busmatrix and GPIU unit to pin. I might have forgotten a few things.

It might quite well ben that the NOP was thrown away in theprefetch unit so its execution time was 0.

Welcome to the world of 32-bitters. These are not microcontrollerst anymore - SoC rather.

JW

AVI-crak · ‎2017-07-27

Posted on July 27, 2017 at 11:34

You need to use direct write to registers.

Arrange the code in the sram memory, to prevent slow reading from the flash.

Use inserts in assembler to exclude GCC optimization.

Use a simple cycle from the maximum number to zero.

Use in the body of the loop a large number of NOP commands (10-50).

Use the system counter DWT to calculate the cycles.

Use an external MCO1 / MCO2 contact to monitor the system frequency.

Kill the desire to use HAL, and begin to study the documentation.

Change the profession, country or sex in the kitchen.

Create your own processor, your own forum, and troll users.

AvaTar · ‎2017-07-27

Posted on July 27, 2017 at 16:59

Agree, it's not that simple.

I suggest to measure the toggling alone, than a multiple NOPs (100 or 1000).

Finally subtract the toggle time.

In the Linux world, that measure is called BOGUS Mips ...

David SIORPAES · ‎2017-07-28

Posted on July 28, 2017 at 10:58

Try using SEV instruction to emit a pulse instead of toggling GPIOs.

Wrapping 10 NOPs with SEVs on a STM32F401 clocked at 84MHz is consistent with what expected

Ciavolino.Giuseppe · ‎2017-07-28

Posted on July 28, 2017 at 13:30

I apologize if I have not answered yet, but as you have guessed, I do not have great

skills in the field of the embedded and I'm trying to interface your tips with my skills.

Also, I speak a bad English and I want to avoid saying stupid things. Unfortunately I do not have an oscilloscope at home and I can only use the university one,

at this time I can not do specific tests but Monday i will post my results.

I would like to use the stm32f746g-DISCO for acquisition of environmental noise, using the codec

WM8994 for the ADC/DAC's stuff, and the MCU for the data processing.

I've implemented the comunication between MCU and WM8994 (I2C for the settings and SAI for the data)

and now I would like to test the potentiality.I'm using DMA in circular mode to exchange data

between the codec and the microcontroller:

every time a sample is receveid the DMA start a routine interrupt and during this routine

I will do some processing.

So, before working on the algorithm, i would like to know if there will be the potentiality and

for this reason I've tried to measure the NOP.

Thank you so much for support, and sorry for any stupid things I could say.

AvaTar · ‎2017-07-28

Posted on July 28, 2017 at 11:52

Honestly, I don't understand the idea behind this.

For a synchronous MCU design with given clock and an instruction in the pipeline, you get - surprise, surprise - the execution time stated in the datasheet.

Testing an instruction sequence under realistic conditions (clock, Flash latencies, caches, interrupt latencies, DMA bus load, etc.) gives you more (and useful) information.

David SIORPAES · ‎2017-07-28

Posted on July 28, 2017 at 12:24

As far as I understood the OP is surprised about a NOP instruction execution time he measured (60ns).

Was just suggesting a better method to accomplish succesfully what he had in mind, i.e.: measuring execution time of a NOP instruction.

AvaTar · ‎2017-07-28

Posted on July 28, 2017 at 12:29

As far as I understood the OP is surprised about a NOP instruction execution time he measured (60ns).

Yes, my understanding as well.

My comment was directed towards the OP.

I would be really surprised if that measurement, (if done correctly !), yielded something else than the datasheet-specified time.

Thus the measurement is IMHO pretty worthless.