cancel
Showing results for 
Search instead for 
Did you mean: 

STM32F40x @ 168MHz wait states and execution from RAM

arne
Associate II
Posted on January 16, 2013 at 10:06

Hi,

our institute is currently using STM32F103 devices in a distributed control system. However we are evaluating a switch-over to STM32F40x devices for the sake of its FPU and higher clock speed at comparable current consumption (according to datasheet).

Toying with the Clock Configuration Tool (AN3988) I stumbled across the Wait States mentioned at full 168MHz clock speed. Can someone point me at relevant documentation to understand how these affect performance (who is waiting for whom and what to do about it)?

Also, suspecting the Flash may be the bottleneck (wild guess here), would running the program from RAM improve performance in general purpose applications?

Thanks in advance.

Arne

#stm32f40x
14 REPLIES 14
Posted on January 16, 2013 at 10:32

In every mcu the FLASH is the bottleneck. Fastest embedded FLASH memories today are around 20ns (Fujitsu IIRC); ST's apparently somewhat slower, around 35ns at full supply voltage (slower at lower of course) allowing some 30MHz operation at zero wait states. It's still an impressive speed, though.

To improve for that, ST uses a wide FLASH word (128 bit, i.e. fetches 8 instruction halfwords at once) plus something they call ART accelerator, which is just a relabelled jump cache used in the industry for years e.g. in the 100MHz SiLabs (ex Cygnal) '51s. And they amended it with a marketing lie, claiming that using the accelerator enables to run the core at full speed without apparent waitstates, which of course is true only for the synthetic CoreMark benchmark they base their claim on. (I guess they deliberately tuned the jump cache to this benchmark). The impact of cache on performance of real-world programs is of course very hard to estimate and depends on many factors.

Details on all this is to be find (somewhat surprisingly) in PM0081.

JW

arne
Associate II
Posted on January 16, 2013 at 10:48

So you're confirming my initial guess. However, in the STM32 Technical Updates Issue 1st (top post at this forum) a benchmark is posted (page 42) that rates the Code-From-Flash setup faster than any other due to wait-states inside the bus-matrix. Of course without knowledge about the exact nature of the benchmarking code all this is a bit fluffy but it makes wonder nevertheless.

Besides, are there special issues when running from RAM? Like disallowed instructions, etc.?

Posted on January 16, 2013 at 12:02

Ah, I see. This is unrelated to FLASH-fetch-time-latency, and I guess these tests were performed either at low speed (with 0 WS in the FLASH controller), or with a simple enough code so that all the instructions were kept in the cache.

It would be nice to see the code for that benchmark, though (anybody from ST listening?)

This is a separate issue and you will have much more similar, once you start to use the DMAs... Face it, these chips are not the straighforward microcontrollers anymore, with strict and simple timing relationships. This is why I like to use the traditional system-on-chip tag for them, they are just like boards with processors and memory subsystem and IO subsystem, all in separate chips and with various mutual timing relationships...

JW

[EDIT 2018] STM32 Technical Updates as attachments to https://community.st.com/s/question/0D50X00009Xkba4SAB/stm32-technical-updates [/EDIT]

Posted on January 16, 2013 at 13:06

Some more research: while I don't pretend I understand the reasons behind the results of said benchmark, one thing which is strange is, that running from SRAM was always slower than running from FLASH.

I believe, the answer lies in http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337g/BABGAJAB.html

12.5.6. Pipelined instruction fetches

Note

To provide a clean timing interface on the System bus, instruction and vector fetch requests to this bus are registered. This results in an additional cycle of latency because instructions fetched from the System bus take two cycles. This also means that back-to-back instruction fetches from the System bus are not possible.

12.5.6. Pipelined instruction fetches

Note

Instruction fetch requests to the ICode bus are not registered. Performance critical code must run from the ICode interface.

Note, that in those cases where SRAM was remapped to ICode, there were other reasons (bus contention) for the slowdown.

This is a Cortex-M3 manual but I believe the Cortex-M4 is no different in this regard. Thus, I believe you should see similar results on the F1xx too.

JW

frankmeyer9
Associate II
Posted on January 16, 2013 at 13:10

I basically agree with JW. A benchmark doesn't mean much, if you don't reveal everthing. Even Microchip advertises its 0.04 CoreMark PIC18F as ''high performance RISC controller''.

I remember, somewhere in the technical docs is a table buried that relates clock frequencies and Flash wait states. And to add to JW's remark, I believe it where 5 waitstates at 168MHz.

The actual performance really depends on the application. The usual ''cache miss'' and ''pipeline stall'' issues apply.

Besides, are there special issues when running from RAM? Like disallowed instructions, etc.?

 

If memory serves me well, the F4 can't execute code from CCM RAM, but has no otherwise restrictions.

Posted on January 16, 2013 at 16:57

If memory serves me well, the F4 can't execute code from CCM RAM, but has no otherwise restrictions.

 

Indeed the F4's 64KB CCM (0x10000000) is attached to the data bus, but provides contention free access as it can't be used for DMA.  On the F3 series chips the CCM is on the code/instruction bus so can execute instructions.

Executing from the primary 128KB of SRAM (0x20000000) is reasonably predictable/consistent, at least if you exclude DMA contention.

Executing from FLASH can be somewhat inconsistent, but still within a range. It is impacted by the ART cache content, and flash line boundaries. It looks to provide near instantaneous prefetch (critical path in design and impacted by voltage, temperature, silicon speed, see F2 errata) so it can benchmark faster than SRAM with a smaller 32-bit read path and accesses that reach down to physical memory, under certain conditions.

A breakdown of the ART function might make an interesting App Note from ST, along with some modelling. Especially if it has some predictive/speculative read-ahead. It does a good job of masking the inherent slowness of the flash, but this will be clearly exposed when it does occur (likely to be memory loads/copies rather than code execution, or DMA contention where flash transaction won't use ART).

I don't expect you'll see a massive benefit running from RAM, suggest you benchmark against the core cycle counter in the trace unit (DWT_CYC), as static analysis probably won't be simple. Where you will see massive differences are where you want to erase/write flash, and don't want to stall the CPU for long periods, to the point peripherals with over/under flow.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Nickname12657_O
Associate III
Posted on January 17, 2013 at 10:54

Hi All,

@fm: ''I remember, somewhere in the technical docs is a table buried that relates clock frequencies and Flash wait states.''

It should be the

http://www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_LITERATURE/PROGRAMMING_MANUAL/DM00023388.pdf

.

@jan: ''It would be nice to see the code for that benchmark''

In the slide 41, it is said ''The test program is counting the number of iterations executed by CPU in a fixed period of time.''

The results in page 45 are given depending on the remapping.

Cheers,

STOne-32
frankmeyer9
Associate II
Posted on January 17, 2013 at 11:20

In the slide 41, it is said ''The test program is counting the number of iterations executed by CPU in a fixed period of time.''

 

That reminds me on that (in-)famous pseudo-benchmark the PC's in the '90 were advertised with. It does not only matter how many instructions a MCU can execute, but what it actually gets done. CoreMark is such a test.

I am not intending to run down the stm32f4. But in fact, most benchmark results are solely marketing instruments, and that's no invention of ST.

 No one advertises his weak spots...

Posted on January 17, 2013 at 11:55

> @fm: ''I remember, somewhere in the technical docs is a table buried that relates clock frequencies and Flash wait states.''

> It should be the

http://www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_LITERATURE/PROGRAMMING_MANUAL/DM00023388.pdf

.

PM0081 does contain details on the ART, indeed, and I said that already in my first post (although neglected to add the link). But as far as this particular table is concerned, while it's true PM0081 contains it, I'd say the reference is the same table in the RM0090 (Table 4 in current revision 3). Note, that it has changed between revisions 1 and 2 of RM, and PM0081 appears to contain the oldest version.

Btw. I'd say the right place for this table would be the datasheet anyway, as it's apparently manufacturing/testing-related and -dependent. There *is* a related table (Tab.12) in the datasheet, so I'd say everything related should be in one place. But then I would have much more suggestions, corrections and typo fixes for the manuals/datasheet, and it appears that ST does not care anyway.

> @jan: ''It would be nice to see the code for that benchmark''

> In the slide 41, it is said ''The test program is counting the number of iterations executed by CPU in a fixed period of time.''

Iterations of *what*?

What's the problem with publishing the real stuff, here the software used for benchmarking, along with the marketing fluff?

JW