cancel
Showing results for 
Search instead for 
Did you mean: 

How can CCM help to improve complex computation performance (such as FFT, image decoding)

cyqdtc
Associate II
Posted on June 10, 2014 at 04:55

The scenario is as following:

1. Data is transferred by DMA from a camera or from an ADC via SPI to SRAM

2. Computation should be performed based on the incoming data.

3. The computation includes multiplication with look-up tables, division with look-up tables and etc.

How could CCM help to improve speed in the process?

1. Should data be moved from SRAM to CCM to perform computation? Since SRAM will be accessed by DMA from time to time.

2. If data arrays are stored in CCM, then the C-M4 does not need to take AHB any more to read data from SRAM. However, CCM is only connected with C-M4 through D-Bus. Will the computation instructions be loaded from FLASH? Could C-M4 store some kind of instructions?

3. Is it a good idea to store the look-up tables used during computation into CCM? 

4. Anything else? Comments are really appreciated.

Thank you very much.

#paranoia
3 REPLIES 3
jpeacock2399
Associate II
Posted on June 10, 2014 at 15:52

Look at the bus matrix, in particular how to avoid contention with DMA and how to take advantage of the D cache.  One approach is to DMA into the cached SRAM bank while running the program from flash and stack/heap from CCM.  This allows the DMA to run in parallel with code, no contention.  Effectively the DMA transfers are free, in that there's no impact on instruction fetches.

There's no benefit in copying data from SRAM to CCM after a DMA.  Instead do the DSP processing by fetching the DMA X operand from SRAM, Y operand from CCM and store result back in CCM.  This minimizes DMA contention in the DSP loop.  If you copy the data to CCM first you lose the overlapped access, you have bus contention with ongoing DMA plus you waste the copy time.

Depending on DMA transfer size you might benefit from placing the DMA buffer in the second, uncached SRAM.  You do lose the benefit of caching on sequential access, but you might make it up in placing tables in cached SRAM for overlapped access.

Make sure your DMA buffer is word aligned, starts on a 1K boundary and ideally has a transfer size in whole words.  That gives you the maximum benefit of DMA FIFO, 4 word bursts for a potential 75% reduction in bus contention.

Depending on DMA contention you can mix table placement between flash (for constants), CCM and cached SRAM.

Now if this is a homework assignment your instructor will immediately spot what's wrong with the above statements....

   Jack Peacock
cyqdtc
Associate II
Posted on June 11, 2014 at 14:14

Dear Jack,

Thanks for your clarification and here comes my plan:

Stack/Heap: CCM

Table: CCM, operation is between table and SRAM

Result: CCM, Results will be further used for later process

Questions:

DMA:FIFO supports 4 words and also 8 halfwords. Is it necessary to be word aligned?

Data Cache: SRAM 0-wait, not cache I think. Do you mean flash cache? There seems to be a 8*128 bit cache. 

Haha, thanks for your kind reminder. I feel the warm care from you! But dont worry, I am doing my own project.. And is there such a practical school teaching this kind of staff? I would love applying one.

With thanks and regards,

Richie

jpeacock2399
Associate II
Posted on June 11, 2014 at 15:51

There is a flash prefetch, which has to do with how internal flash is laid out, but also both an instruction (I) and data (D) cache.  The D cache applies to the first SRAM bank but not the second, 16KB bank.  The advantage of this is if you are doing low to medium speed DMA and only looking at the incoming data once you can preserve the D cache.  If you have a small, tight DSP style loop with one operand from SRAM2 DMA and the other from an SRAM1 table the data cache will speed up the SRAM1 access time.  Yeah its nanoseconds but they do add up.

DMA FIFO is best used to reduce bus contention between CPU and DMA units.  Ideally you want 32 bit transfers aligned at whole words but you can still benefit from half word alignment.  The difference is you get 4 halfwords in the FIFO instead of 4 words, so twice the possible contention.

Contention may seem like a minor issue until you have a lot of background DMA running along side the CPU.  Minimizing contention makes everything go faster.  Like cache, its nanoseconds but they still add up.

  Jack Peacock