cancel
Showing results for 
Search instead for 
Did you mean: 

How to use the memory protection unit to manage cache coherency on high-performance MCUs

Ruchit N
ST Employee

Summary

This article examines cache coherency issues specific to STM32 high-performance microcontrollers, describes the reasons behind their occurrence, and outlines how the memory protection unit (MPU) serves as the primary tool for configuring memory attributes to mitigate these problems. Practical implementation examples are also provided.

Introduction

High-performance STM32 MCUs, which includes STM32F7, STM32H7, STM32N6, offer substantial performance gains, largely attributed to their integrated level 1 (L1) ICACHE and DCACHE. These caches minimize memory latency by storing frequently used instruction and data close to the CPU. However, this performance enhancement introduces a critical challenge known as cache coherency, especially in systems where multiple bus masters, such as the CPU and DMA controllers, access the same memory regions.

Figure 1. STM32H7Rx/7Sx system architecture, AN6062Figure 1. STM32H7Rx/7Sx system architecture, AN6062

1. Understanding cache and the coherency problem

  • Instruction cache (ICACHE): Stores recently fetched instructions. Speeds up code execution, especially loops.
  • Data cache (DCACHE): Stores recently accessed data. Accelerates data reads and writes. STM32 typically uses a write-back policy by default for cached SRAM, meaning writes might only update the cache initially, not the underlying RAM immediately.

Cache coherency issues arise when the view of memory differs between masters.

  • CPU-write/DMA-read scenario:
    1. The CPU writes data to a buffer in cacheable SRAM (for example, preparing data for UART transmission).
    2. With a write-back policy, the data might only be in the DCACHE, not yet written to SRAM.
    3. The DMA controller reads directly from SRAM to transmit the data.
    4. As a result, the DMA transmits stale data from SRAM because the up-to-date data is still held in the CPU’s DCACHE.
  • DMA-write/CPU-read scenario:
    1. The DMA controller writes incoming data directly into a buffer in SRAM (for example, converted data by ADC).
    2. The CPU might have previously read from this buffer, and the old data could still be resident (cached) in its DCACHE.
    3. The CPU reads the buffer to process the new data.
    4. As a result, the CPU reads the stale data from its DCACHE, unaware that the DMA has updated the underlying SRAM.

Failure to manage coherency leads to data corruption, unpredictable behavior, and notoriously difficult-to-debug errors.

2. Role of the memory protection unit (MPU)

The MPU is a crucial component of the high-performance core. While its primary role is memory protection (preventing unauthorized access between memory regions), it is also the mechanism used to define memory attributes for different regions of the memory map. These attributes dictate how the memory system, including the caches, interacts with that region.

Key MPU attributes for cache coherency management:

  • Cacheability (C bit): Determines if a region can be cached.
  • Bufferability (B bit): Determines if writes can be buffered (allowing the CPU to continue execution before the write fully completes to memory).
  • Type extension (TEX bits): Combined with C and B bits, defines the detailed memory type and cache policy (strongly ordered, device, normal non-cacheable, write-through, write-back).
  • Shareability (S bit): Indicates if a region might be accessed by multiple masters. Setting a region as shareable often influences cache behavior, sometimes disabling caching entirely for that region, depending on the specific memory type configuration.

By configuring MPU regions covering shared memory areas (like DMA buffers), developers can dictate the caching behavior and enforce coherency.

3. Strategies for ensuring coherency using MPU

Below are the primary strategies, along with examples based on STM32H7 assuming txBuffer and rxBuffer are located within a specific SRAM region (for example, starting at 0x30000000) and aligned to 32 bytes.

3.1 Strategy 1: Non-cacheable/device memory (via MPU)

This is the simplest and safest approach by preventing the CPU from caching the shared region altogether.

Configure the MPU region containing the DMA buffers as non-cacheable (for example, normal non-cacheable or, often preferred for peripherals/shared buffers, device memory). All CPU accesses bypass the DCACHE, ensuring the CPU and DMA always sees the same data in SRAM.

For the MPU configuration device memory (C=0, B=1, TEX=0b000) provides stricter ordering and is often suitable for peripheral or shared buffers. Bufferable allows writes to complete faster from the CPU’s perspective.

#include "stm32h7xx_hal.h" // Include appropriate HAL header

// Assume txBuffer/rxBuffer are within 1 KB starting at 0x30000000
#define SHARED_MEM_BASE ((uint32_t)0x30000000) // Example base address
#define SHARED_MEM_SIZE MPU_REGION_SIZE_1KB   // Example size

void MPU_Config_Device_Bufferable(void) {
  MPU_Region_InitTypeDef MPU_InitStruct = {0};

  /* Disable MPU before configuration */
  HAL_MPU_Disable();

  /* Configure the MPU region as Device Bufferable */
  MPU_InitStruct.Enable           = MPU_REGION_ENABLE;
  MPU_InitStruct.Number           = MPU_REGION_NUMBER0; // Use an available region number
  MPU_InitStruct.BaseAddress      = SHARED_MEM_BASE;
  MPU_InitStruct.Size             = SHARED_MEM_SIZE;
  MPU_InitStruct.SubRegionDisable = 0x00; // No subregions disabled
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS; // Full R/W access
  MPU_InitStruct.DisableExec      = MPU_INSTRUCTION_ACCESS_ENABLE; // Or DISABLE if no code executes here

  /* Memory Attributes: Device Bufferable */
  MPU_InitStruct.TypeExtField     = MPU_TEX_LEVEL0;           // TEX[2:0] = 000
  MPU_InitStruct.IsCacheable      = MPU_ACCESS_NOT_CACHEABLE; // C = 0
  MPU_InitStruct.IsBufferable     = MPU_ACCESS_BUFFERABLE;    // B = 1
  MPU_InitStruct.IsShareable      = MPU_ACCESS_SHAREABLE;     // S = 1 (Recommended for Device)

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  /* Enable the MPU */
  HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT); // Or MPU_HFNMI_PRIVDEF_UNLOCKED
}

Pros:

  • Simplest to implement and understand.
  • Ensures coherency for the specified region.
  • Robust and least error prone.

Cons:

  • Degraded CPU performance when accessing the shared buffer, as caching benefits are lost for this region. It can turn to be a bottleneck if the CPU frequently processes data in these buffers.

3.2 Strategy 2: Write-through cache policy via MPU configuration

It is a compromise where CPU writes update both cache and main memory relatively quick.

To do this, an MPU region is configured as cacheable with a write-through policy (C=1, S=0,B=0, TEX=0b000). This solves the CPU-Write/DMA-Read problem without software cleaning. However, the DMA-Write/CPU-Read problem remains, requiring software invalidation.

MPU configuration for the write-though policy:

#include "stm32h7xx_hal.h"

#define SHARED_MEM_BASE ((uint32_t)0x30000000)
#define SHARED_MEM_SIZE MPU_REGION_SIZE_1KB

void MPU_Config_Cacheable_WT(void) {
  MPU_Region_InitTypeDef MPU_InitStruct = {0};

  HAL_MPU_Disable();

  /* Configure the MPU region as Cacheable Write-Through */
  MPU_InitStruct.Enable           = MPU_REGION_ENABLE;
  MPU_InitStruct.Number           = MPU_REGION_NUMBER1; // Use a different region number
  MPU_InitStruct.BaseAddress      = SHARED_MEM_BASE;
  MPU_InitStruct.Size             = SHARED_MEM_SIZE;
  MPU_InitStruct.SubRegionDisable = 0x00;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec      = MPU_INSTRUCTION_ACCESS_ENABLE;

  /* Memory Attributes: Normal, Write-Through, No Write-Allocate */
  MPU_InitStruct.TypeExtField     = MPU_TEX_LEVEL0;           // TEX[2:0] = 000
  MPU_InitStruct.IsCacheable      = MPU_ACCESS_CACHEABLE;     // C = 1
  MPU_InitStruct.IsBufferable     = MPU_ACCESS_NOT_BUFFERABLE;// B = 0
  MPU_InitStruct.IsShareable      = MPU_ACCESS_NOT_SHAREABLE; // S = 0

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);
}

Software maintenance:

  • SCB_CleanDCache_by_Addr(): Not required before the DMA reads CPU-written data.
  • SCB_InvalidateDCache_by_Addr(): Required before the CPU reads DMA-written data.
// In DMA RX Complete Callback:
// Invalidate D-Cache for the rxBuffer BEFORE CPU reads it
SCB_InvalidateDCache_by_Addr((uint32_t*)rxBuffer, BUFFER_SIZE);
// Now CPU can read rxBuffer

Pros:

  • Simplifies the CPU-Write/DMA-Read scenario.
  • CPU reads from the shared buffer still benefit from caching.

Cons:

  • Write performance is generally lower than write-back.
  • Requires software cache invalidation for the DMA-write/CPU-read case. Adds complexity compared to non-cacheable.

3.3 Strategy 3: Cacheable memory with software cache maintenance

Aims for maximum CPU performance by allowing full caching (Write-Back; C=1, B =1, S=0 TEX=0b001) but requires careful software management.

  • Mechanism: Configure the MPU region as cacheable (typically write-back, write-allocate for best performance). Software must explicitly use CMSIS functions SCB_CleanDCache_by_Addr() and SCB_InvalidateDCache_by_Addr() at the correct times.
  • MPU configuration (write-back):
#include "stm32h7xx_hal.h"

#define SHARED_MEM_BASE ((uint32_t)0x30000000)
#define SHARED_MEM_SIZE MPU_REGION_SIZE_1KB

void MPU_Config_Cacheable_WB(void) {
  MPU_Region_InitTypeDef MPU_InitStruct = {0};

  HAL_MPU_Disable();

  /* Configure the MPU region as Cacheable Write-Back */
  MPU_InitStruct.Enable           = MPU_REGION_ENABLE;
  MPU_InitStruct.Number           = MPU_REGION_NUMBER2; // Use another region number
  MPU_InitStruct.BaseAddress      = SHARED_MEM_BASE;
  MPU_InitStruct.Size             = SHARED_MEM_SIZE;
  MPU_InitStruct.SubRegionDisable = 0x00;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec      = MPU_INSTRUCTION_ACCESS_ENABLE;

  /* Memory Attributes: Normal, Write-Back, Write-Allocate */
  MPU_InitStruct.TypeExtField     = MPU_TEX_LEVEL1;           // TEX[2:0] = 001
  MPU_InitStruct.IsCacheable      = MPU_ACCESS_CACHEABLE;     // C = 1
  MPU_InitStruct.IsBufferable     = MPU_ACCESS_BUFFERABLE;    // B = 1
  MPU_InitStruct.IsShareable      = MPU_ACCESS_NOT_SHAREABLE; // S = 0

  HAL_MPU_ConfigRegion(&MPU_InitStruct);

  HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);

  // Optional but recommended: Invalidate D-Cache once after enabling
  // if its state is unknown.
  // SCB_InvalidateDCache();
}

Software maintenance:

  • SCB_CleanDCache_by_Addr(): Required before the DMA reads CPU-written data.
  • SCB_InvalidateDCache_by_Addr(): Required before the CPU reads DMA-written data.
#define BUFFER_SIZE 128
// Assume txBuffer/rxBuffer declared and aligned(32)

// Before starting DMA TX:
// CPU writes to txBuffer...
SCB_CleanDCache_by_Addr((uint32_t*)txBuffer, BUFFER_SIZE);
// Start DMA TX...

// In DMA RX Complete Callback:
// DMA finishes writing to rxBuffer...
SCB_InvalidateDCache_by_Addr((uint32_t*)rxBuffer, BUFFER_SIZE);
// CPU reads rxBuffer...

Pros:

  • Potentially, highest performance if the CPU frequently accesses the shared buffer outside of DMA transfers.

Cons:

  • The most complex and error-prone strategy.
  • Requires precise placement and sizing of Clean/Invalidate calls.
  • Cache maintenance operations introduce CPU overhead.
  • Requires careful consideration of buffer alignment relative to the 32-byte cache line size.

4. Practical implementation considerations

  • MPU initialization: Configure the MPU very early in the boot process, typically within SystemInit() or a dedicated function called immediately after, before caches are enabled and before peripherals using DMA are fully initialized. STM32CubeMX can generate the initial MPU setup code.
  • Region alignment & size: MPU region base addresses must be aligned to their size, and sizes must be powers of 2 (minimum 32 bytes). Ensure your shared buffers fit within appropriately configured MPU regions. Consider placing shared buffers in dedicated sections using linker scripts for easier MPU management.
  • Memory types: Be aware of different SRAM regions on your specific STM32H7 (DTCM, ITCM, AXI SRAM, SRAM 1/2/3/4). Tightly Coupled Memory (TCM) is generally not cached or accessed by DMA. AXI SRAM and other SRAMs are the usual candidates for shared buffers needing MPU configuration for coherency. Consult the device reference manual for memory maps and DMA accessibility.
  • Debugging:
    • Temporarily disable the DCACHE globally (SCB_DisableDCache()) to check if the issue disappears - this strongly points to a coherency problem.
    • Use the debugger to inspect memory content directly in SRAM versus potentially cached values (if the debugger supports cache views).
    • Set breakpoints before/after DMA transfers and cache maintenance calls to verify buffer contents and operation execution.
    • Ensure Clean/Invalidate calls target the correct address range and size, covering the entire modifies or accessed area.

Conclusion

The L1 caches on the STM32 are vital for performance but necessitate careful management of cache coherency, particularly when DMA controllers share memory with the CPU. The MPU is the essential tool for defining memory region attributes and implementing a coherency strategy.

The choice of configuration the shared region depends critically on the application’s specific requirements, data access patterns, and the developer’s tolerance for complexity. Each configuration has its potential pros and cons. For example, configuring the shared region as non-cacheable region would be simple, safe, and potentially have slower CPU access. In contrast, fully cacheable with the software maintenance would have highest potential performance but complex to manage.

Regardless of the chosen strategy, meticulous MPU configuration and thorough testing are essential for building robust and reliable STM32 high-performance applications.

Related links

Version history
Last update:
‎2025-12-15 2:25 AM
Updated by: