STM32F4: 64 kB SRAM inaccessible to all peripherals?

infoinfo989 · ‎2011-11-02

Posted on November 03, 2011 at 03:37

We've been looking at the new STM32F4x because we could really use some additional internal memory, and the F4x family advertises up to 192 kB. Sounds great. So today I attended our local STM32F4x roadshow presentation, and in the process got quite the shock. The F4 block diagram showed this juicy big 64 kB SRAM hanging off the processor data bus, and the presenter said this RAM (named the ''CCM data RAM'') is inaccessible to the DMA controllers or peripherals. He said it can only be accessed by read and write instructions from the processor.

What?? You're kidding right?

Apparently he wasn't kidding, and the F4x datasheet has this to say on the subject:

The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix (see Figure 1: System architecture). It can be accessed only through the CPU.

I'm sure that ST did this for a reason, and perhaps some people will be dancing in the streets with joy. But for us, where our application involves DMA'ing large amounts of data between various peripherals, that 64 kB of SRAM might as well be located on the moon.

Before we throw in the towel on the F4, I thought I'd throw out a crazy question to this board and ask if anyone knows any different to the above? Is there some way to get data into and out of that block of SRAM without needing to execute load and store software operations for every single word? How do we get data into and out of that big chunk of SRAM in an efficient manner? I'm just hoping there's some trick or piece of information I'm missing here, something that perhaps the presenter today wasn't aware of.

Many thanks.

John F. · ‎2011-11-03

Posted on November 03, 2011 at 09:10

Reference Manual RM0090

http://www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_LITERATURE/REFERENCE_MANUAL/DM00031020.pdf

confirms, '' The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix (see Figure 1: System architecture). It can be accessed only through the CPU''

On the plus side, it is zero wait state access.

Tesla DeLorean · ‎2011-11-03

Posted on November 03, 2011 at 14:22

The use of TCM (Tightly Coupled Memory) is quite frequent in ARM designs. I haven't looked at the M4, but this is quite common on ARM9 designs where the cache/mmu are optional, and basically provides a very fast, low latency, memory. At 168 MHz, I would imagine an AHB interaction with the memory would be quite invasive/expensive. It might be one of those performance/size trade offs made. Different memory regions, with divergent performance, are also quite common. Think of Amiga chip-ram, or video memory. That said, I do understand you're angst.

The STM32F4 seminar is still a couple of weeks out here, so I should probably look over the documents. One question I had for my ST reps the other day was if the part had multiple flash planes, because that's a huge issue for the F1's which grind to a halt when writing flash while executing from a different region.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2011-11-03

Posted on November 03, 2011 at 15:40

You should be able to use the 112K and 16K SRAMs for peripheral data transactions. The 64K CCM would be appropriate for the stack, and management structures, along with application data or code.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

infoinfo989 · ‎2011-11-03

Posted on November 03, 2011 at 15:54

For our application, about 4k would be sufficient for the things you mention. Our app is all about moving large chunks of data, so we use the RAM (basically all that we can get) for buffering blocks of data between peripheral ports.

rmteo · ‎2011-11-03

Posted on November 03, 2011 at 16:39

Why not use a $10, 400MHz ARM9 with as much RAM as you want?

Tesla DeLorean · ‎2011-11-03

Posted on November 03, 2011 at 18:47

Yes, probably there a better solutions for a data shovelling exercise.

Note however that the F4 is 40% faster than the ATMEL M4 (SAM4S), which does not have TCM/CCM, or the DMA limitations that brings, at least as I scan the documentation. It would be interesting to evaluate if ST's partitioning affords superior processor performance if all DMA traffic is externalized and does not continually stall the core.

Frank : Is this transient data just passing through or are you actively processing it? And if you can process it in real time is there such a pressing need to hold/buffer so much of it?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

infoinfo989 · ‎2011-11-03

Posted on November 03, 2011 at 20:42

Teo: we'd essentially been doing exactly that on a previous product (actually using an Analog Devices ''Blackfin'' DSP). The problem with that kind of solution is the cost. Having an external memory costs real money, both for component costs as well as assembly cost, PCB real estate, etc. Higher power consumption and system EMI levels as well. Hence we're moving towards a more integrated solution for this version.

Clive: Most of the data is simply being moved. Not much processing happening. It's a camera type of application, so we're shuttling around a bunch of image and audio data, with some buffers for when the ports get busy and we have to briefly wait.

Also, just a note from your previous post. You mentioned 168 MHz, which reminded me of something. Another thing we learned at the seminar was that although the F4x will run at 1.8V, like the F2x does, it cannot run at 168 MHz at 1.8V. When powered at 1.8V, the F4x has a max speed of 128 MHz, giving it essentially the same performance as the existing F2x (120 MHz). To get full speed from this device requires a minimum of 2.4V.

Danish1 · ‎2011-11-05

Posted on November 05, 2011 at 18:22

My reading of the data sheet is that you also can't bit-band the CCM memory.

Incidentally I don't see any mention in the _reference manual_ of where the CCM is in the memory-map. It is only in the data sheet that it is listed as being from 0x10000000 to 0x1000FFFF.

- Danish

Tesla DeLorean · ‎2011-11-05

Posted on November 06, 2011 at 02:38

http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/DUI0553A_cortex_m4_dgug.pdf

The M4 documentation basically indicates that the bit-banding only supports the SRAM and peripherals at 0x20000000 and 0x40000000 respectively.

It does not appear to support it for the 0x10000000 CCM region. Suggesting the purpose is high performance code execution, and FPU access. Remember the flash at 168 MHz is still really 5 wait states, suggesting the stuff is still ~35ns, hopefully ART can mask it. Fujitsu has intrinsically faster flash.

I'll note that the Freescale K60N512 more cleverly places their TCM region at 0x1FFF0000..0x1FFFFFFF, so that it wraps cleanly into the 0x20000000 SRAM providing a contiguous region. Instruction fetches from the lower portion have zero wait states, the upper takes one wait state, data accesses for both are zero.

NXP does not appear to have any TCM.

It will be interesting to see how various M4 implementations fair with FPU/DSP algorithms.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..