2013-05-19 11:44 PM
So I've been noticing recently that my stm32f4 discovery board is running under 168 MHz. Or at least I think. I'm trying to get this loop
uint32_t i = 0;
for
(;i<76799;i++)
{
GPIOD->ODR = 0x4100;
GPIOD->ODR = 0x5100;
}
to execute as fast as possible and I noticed that one iteration of the loop takes 143nS to complete, which is about 24 cycles @ 168 MHz. Surely comparing a 32-bit integer, writing to a register twice, and jumping back to the beginning shouldn't take 24 cycles?
If I probe PD12 (the pin that's changing), it takes about 21 nS from the time it goes low to the time it goes high again. 48MHz? What's going on here?
I've attached the system_stm32f4xx.c just in case.
EDIT: I also should have mentioned that other clocks (SPI, SDIO, TIM4) work correctly so maybe this MCU is just slower than I expected.
2013-05-21 04:29 PM
@knik, thanks for the example. I'll try to work through it. However, this is only good for one line. What if I want to insert a chunk of asm (like the loop Clive posted)? I presume there is a better way to do it rather than writing multiple lines of asm volatile(''stuff''). If I did write it line after line, can I be sure that the compiler will leave it it that order? I don't want my branches jumping somewhere they're not supposed to.
@Jan I used the STM32 ST-LINK Utility to step through the instructions. No, I'm not stepping through C code, I understand that's not very useful in this scenario. But you are likely correct when it comes to what the Utility is hiding from me. As I'm sure you've guessed by now, I'm no professional when it comes to programming. You are of course correct in catching ODR. I meant to write MODER, they look pretty similar last night I suppose. After changing that and setting the pin to 100 MHz in OSPEEDR it works, giving 42MHz bang on.2013-05-21 11:04 PM
@knik, thanks for the example. I'll try to work through it. However, this is only good for one line. What if I want to insert a chunk of asm (like the loop Clive posted)?
No problem, you can use multiple insns is a single asm:int out1 = 0x888;
int out2 = 0x222;asm volatile (''strh %1, [%0]\nstrh %2, [%0]\nstrh %1, [%0]
''::''r''(&GPIOC->ODR),
''r''(out1), ''r''(out2)
);
2013-05-22 12:27 AM
2013-05-22 03:24 AM
> No, I'm not stepping through C code, I understand that's not very useful in this scenario. But you are likely correct when it comes to what the Utility is hiding from me.
I was assuming you are stepping through C code because you did not post the disassembly which would show us the instructions through which you are stepping. The compiler might generate very different sequences of instructions depending on various optimization and other settings. I also find it strange that you determined a port write (one str* instruction) to be 4 cycles, that's quite a lot - IMHO it indicates something unusual - e.g. execution from SRAM (?). You might find interesting a discussion we had on a very similar topic on a local mailing list recently, http://list.hw.cz/pipermail/hw-list/2013-April/438309.html and followup (you might understand a bit of Slovak/Czech I guess, or use help of some automatic translator; and the pictures are sort of self-explanatory). Upper trace on the picture r.png is a pulse appearing when SEV is executed; middle trace is the pin modified by GPIO writes; lower trace is system clock output 1:1 onto MCO. Traces are taken at default HSI=16MHz clock, but the same waveforms appear with gearing up to 168MHz (except due to my LA's limitation the pulses appear to be irregular, that's why I did not use that for the posted pictures). > I meant to write MODER, they look pretty similar last night I suppose. It's always loads of fun when one reads what he previously wrote half-sleeping... ;) JW2013-05-22 04:34 AM
I also find it strange that you determined a port write (one str* instruction) to be 4 cycles, that's quite a lot - IMHO it indicates something unusual - e.g. execution from SRAM (?).
It really wouldn't surprise me that an AHB transfer to the GPIO peripheral would take 4 HCLK, the Write Buffer might normally mask this, except for the back-to-back writing exposing the true latency/throughput. I can't find a quick cite, but the CRC unit admits to a 4-cycle speed, which is quite high for a parallel implementation with random logic.2013-05-22 04:53 AM
> It really wouldn't surprise me that an AHB transfer to the GPIO peripheral would take 4 HCLK,
> the Write Buffer might normally mask this, except for the back-to-back writing exposing the true latency/throughput. I am surprised by that especially in light of the results of discussion I linked to. The conclusion there was, that the write to GPIO is indeed 1 cycle; it's the specifics of AHB transaction (i.e. the CPU-to-AHB bridge and the need to ''reclaim'' the bus) which might slow things down in adverse cases, when it took 3 cycles. I have also learned in that thread that older-than-'F4xx SMT32 devices have GPIO on APB bus, which of course changes the picture substantially (I was not aware of this as I am a relative newcomer to STM32); but here we discuss specifically 'F4xx. > I can't find a quick cite, but the CRC unit admits to a 4-cycle speed, which is quite high for a parallel implementation with random logic. I'd expect the GPIO (including the RMW feature) being an order of magnitude (or maybe even two) simpler than CRC. JW2013-05-22 07:03 AM
I'd expect the GPIO (including the RMW feature) being an order of magnitude (or maybe even two) simpler than CRC.
The logic involved is not that complex or deep. I've designed silicon that can do a 16-bit CRC operation in a single clock cycle, and isn't close to being a critical path between two flip-flops in asynchronous design. If you can build a 32-bit adder that runs at 168 MHz, a 32-bit CRC should be a trivial exercise. These things should be doable a bus speed. The GPIO unit on the F1's were rather slow, the toggle speed tests on the other STM32's have always been lackluster, and people have been here regularly complaining about it.2013-05-22 07:47 AM
Using a test based on my assembler code, each ''str'' takes 1 cycle, runs at 168 MHz, and generates a toggling output close to 84 MHz
2013-05-22 04:54 PM
Sorry for the late reply, it turns out this forum doesn't accept input from my phone correctly. This is what was supposed to be in the blank post up there.
''OK, I see. It follows a similar convention as printf and whatnot. After spending all this time I finally figured out that labels need a colon immediately following, and I made a loop. The pin toggling is now plenty fast. It took my 3 MHz C loop up to about 30MHz with asm. Thanks for all the help guys.'' @Jan, here are the 4 instructions the compiler came up with: GPIOD->ODR = 0x4100; 80017c2: f44f 6340 mov.w r3, #3072 ; 0xc00 80017c6: f2c4 0302 movt r3, #16386 ; 0x4002 80017ca: f44f 4282 mov.w r2, #16640 ; 0x4100 80017ce: 615a str r2, [r3, #20] I don't understand why the GPIO address is being loaded 16 bits at a time, but other than that it looks as fast as you can get without knowing that your registers are loaded with the right data. Also, that's a very interesting image. I didn't know that new instructions could continue being fetched and executed while the GPIO (and presumably other peripherals?) is still busy. @clive1, Without any loop overhead I can confirm that I get 84 MHz as well with just STR. Turns out that's actually too fast! A NOP and the subtract instruction fixed it though2013-05-22 06:45 PM
Sorry for the late reply, it turns out this forum doesn't accept input from my phone correctly. This is what was supposed to be in the blank post up there.
Yeah, it's pretty hopeless software, I tried from a Nexus 7 tablet once, with similar results.I don't understand why the GPIO address is being loaded 16 bits at a time, but other than that it looks as fast as you can get without knowing that your registers are loaded with the right data.
The ARM instruction set is capable of loading some immediate/literal values, others that don't encode can be placed in a ''literal pool'' typically at the end of a subroutine, and loaded with a PC relative load. Thumb 2 added some loads for the top/bottom 16-bit halves of a register, the way the flash lines work, along with prefetch/pipelining work in the favour of in-lining the literal in two pieces.