cancel
Showing results for 
Search instead for 
Did you mean: 

M7 and H7 cache (sharable performance)

Franzi.Edo
Senior
Posted on November 27, 2017 at 11:15

Dear All,

I have some questions concerning the cache of the cortex M7 & H7 micro controller.

I use an external SDRAM and the data and instruction caches are activated. Now, if I define the region as 'shareable', I lose a factor 3 in the system performance (just like if the cache is not activated). Any experience on it? Here is my MPU initialization.

Sharable ... but bad performances

static void _MPU_Configuration(void) {

      MPU->CTRL = 0x00000000; // Disable the MPU

// Attributes for the SDRAM area (0xD0000000)

      MPU->RNR = 0x00000000;      // Region 0

      MPU->RBAR = 0xD0000000;     // Address

      MPU->RASR = (0<<28)         // XN: 0 executable

               | (3<<24)         // AP: 11 read-write

               | (0<<19)         // TEX: 000 normal

               | (1<<18)         // S: 1 shareable

               | (1<<17)         // C: 1 cashable

               | (0<<16)         // B: 0 non bufferable

               | (0<<8)          // Sub-region disable

               | (22<<1)         // 8-MB

               | (1<<0);         // Region enabled

      MPU->CTRL = (1<<2)          // Enable the usage of all the default map

               | (1<<1)          // MPU is enabled during the fault

               | (1<<0);         // MPU enabled

      MEMO_SYNC_BARRIER;

      DATA_SYNC_BARRIER;

      INST_SYNC_BARRIER;

// Enable branch prediction

// Normally not necessary (always on)

      SCB->CCR |= (1<<18);

      DATA_SYNC_BARRIER;

}

Non sharable ... and great performances

static void _MPU_Configuration(void) {

      MPU->CTRL = 0x00000000; // Disable the MPU

// Attributes for the SDRAM area (0xD0000000)

      MPU->RNR = 0x00000000;      // Region 0

      MPU->RBAR = 0xD0000000;     // Address

      MPU->RASR = (0<<28)         // XN: 0 executable

               | (3<<24)         // AP: 11 read-write

               | (0<<19)         // TEX: 000 normal

               | (0<<18)         // S: 0 non shareable

               | (1<<17)         // C: 1 cashable

               | (0<<16)         // B: 0 non bufferable

               | (0<<8)          // Sub-region disable

               | (22<<1)         // 8-MB

               | (1<<0);         // Region enabled

      MPU->CTRL = (1<<2)          // Enable the usage of all the default map

               | (1<<1)          // MPU is enabled during the fault

               | (1<<0);         // MPU enabled

      MEMO_SYNC_BARRIER;

      DATA_SYNC_BARRIER;

      INST_SYNC_BARRIER;

// Enable branch prediction

// Normally not necessary (always on)

      SCB->CCR |= (1<<18);

      DATA_SYNC_BARRIER;

}

Thank you for your advises,

Best regards

   Edo.

#m7-and-h7-cache-(sharable-or-not-)
11 REPLIES 11
Nesrine M_O
Lead II
Posted on November 27, 2017 at 13:05

Hi

Franzi.Edo

,

I recommend you to have a look to the

http://www.st.com/content/ccc/resource/technical/document/application_note/group0/0d/b5/e7/b7/47/0c/4a/ae/DM00306681/files/DM003066pdf/jcr:content/translations/en.DM003066pdf

application note.

This application note is provided with the

http://www.st.com/content/st_com/en/products/embedded-software/mcus-embedded-software/stm32-embedded-software/stm32cube-embedded-software-expansion/x-cube-perf-h7.html

embedded software package that includes the stm32h7x3_cpu_perf project aimed at demonstrating the performance of CPU memory accesses in different configurations with code execution and data storage in different memory locations using L1 cache.

-Nesrine-

Posted on November 27, 2017 at 13:32

From Cortex-M7 TRM:

By default, only Normal, Non-shareable memory regions can be cached in the RAMs.

Caching only takes place if the appropriate cache is enabled and the memory type is

cacheable. Shared cacheable memory regions can be cached if CACR.SIWT is set to 1.

JW

Posted on November 27, 2017 at 16:21

Thank you Jan. That clarify the situation.

Best regards

  Edo

Posted on November 28, 2017 at 01:28

what is the result ?

is it a fix or is there a problem ?

sharing cache ? why would you need to do that ?

do you share with one or more DMAs or programs ?

are you discussing the security side ?

are you trying to offer APIs, but hide the code ?

Sorry that I don't understand but would like to.

I have made a 208pin PCB where the H7 seems to be pin for pin and pin function for function drop in replacement

Posted on November 29, 2017 at 17:14

Hi Nick,

Not sure it is a problem. I just want to understand why if I set the bit S (shareable) in the MPU for my SDRAM area, I have a loss of performance of a factor 3. This behavior is the same for the M7 and for the H7. Accordingly to the ARM document 'DDI0403D arm architecture v7m ...' it is of course allowed to define an external SDRAM cashable and sharable. In my case I have a multicaster (CPU + DMAs) that share the memory. The effect of this bit is not clearly described; I just measure this huge difference in the system performance and I would like to understand.

Edo

Posted on November 29, 2017 at 17:54

Cache coherency across multiple actors (ie other CPU, FPGA, External DMA, etc) is VERY difficult and requires high gate counts. And usually the source of a long lists of corner errata.

ARM has generally avoided these issues by going to the memory in these cases, and thus exposing the worst case performance for those memories. External memory is slow, SDRAM has significant recycling latency.

Also, if things aren't actually 'memories', in-order completion and write-combining become issues.

Basically if you a DMAing into memory and there are no coherency methods you're basically stuck eating the performance of the memory you're doing it on, use internal memory if you want/expect speed.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Franzi.Edo
Senior
Posted on November 29, 2017 at 19:20

Hi Clive,

Thank you for your comment. The gate count is not enough to explain a so bad performance. In my tests, I set the bit Shareable but I use only one master, the CPU (the DMA is disabled). The performance of the SDRAM cashed without the Sharing bit is great (10-15% less performed than the internal SRAM).

The real effect of this bit on the bandwidth of the memory for me is still not clear.

Anyway, thank's

  Edo

Posted on November 29, 2017 at 20:33

On F4 parts which don't cache SDRAM execution speed is 6x slower. Try doing an XCHG to memory on your x86 box for some whiplash inducing stalls.

Perhaps there are some 'ARM Certified Engineers' that can better explain the ramification, or expected performance, of 'sharable'.

If using DMA there are going to be coherency issues, you really want to use internal single cycle memory for those tasks, and save the cache for computationally important data on slower memory.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Posted on November 29, 2017 at 23:02

Don't expect any more sophistication than what the quote from the TRM above says: if you set memory type to Shareable in MMU, it's *not* cached (unless mentioned CACR.SIWT set; and even then caching is only write-through). Fullstop.

If you want to read of all ramifications of the Shareable tag, read that TRM and the ARMv7M ARM (Architecture Reference Manual - quite funny name); but as long as caching is concerned this is all.

In my tests, I set the bit Shareable but I use only one master, the CPU (the DMA is disabled).

The cache is inside the processor, but the processor has no idea of whatever is beyond its bus interfaces (AXIM, TCM). It does not know how the memories are connected. It does not know what else may be connected to the memories (through a conflict-resolving bus), potentially changing its content 'while the processor is not looking' (read: resulting in cache incoherence).

For the processor *every* memory is external, even those on-chip; and every addressed resource is a memory, even if its access has side effects (as in peripheral registers). This is exactly why the MMU is there, to assign areas which are 'special'.

Real performance does not come by mägic, but by thorough understanding of the system at hand... and learning, learning, learning...

JW