[BUG] Missing compiler and CPU memory barriers

Piranha · ‎2020-02-01

NOTE! This is a bug report for all STM32 MCUs with Ethernet peripheral, but it has a highly valuable information for all software development on/for any platform.

Compiler barriers

The compiler is not required to keep the code order of non-volatile variables, even relative to volatile variables. This is well described in an article "Nine ways to break your systems code using volatile" section "5. Expecting volatile to enforce ordering with non-volatile accesses".

In a descriptor structure definition only Status member of the descriptor is qualified as volatile (__IO). When code sets ControlBufferSize, the compiler is not required to keep the code order of assignments and in compiled code setting OWN bit can be placed before writing ControlBufferSize.

CPU memory barriers

Even, when instructions are compiled in the intended order, the CPU is still not required to execute those in the compiled order. The Cortex-M7 processor can re-order memory transactions for efficiency, or perform speculative reads. Though at the moment of writing it is the only ARM Cortex-M core, which does it, but that can (and most likely will) change in a future. The solution is either to use DMB instruction or to configure descriptor memory as a device memory type with MPU. DSB instruction and strongly-ordered memory type also works, but are unnecessarily more restrictive. Also note that DTCM is always threated as normal memory type, regardless of MPU configuration.

In addition to the previous example, ignorance of this introduces even more bugs. When checking-restarting DMA operation, if descriptor memory is not configured as a device memory type or is located in DTCM, instruction reordering can also cause DMASR to be read before OWN bit has been actually written.

Solution

ARM has introduced a __COMPILER_BARRIER() macro, but that is currently unavailable in ST's shipped code, because ST is slow on updating even the most basic CMSIS-Core header files. However those header files have __DMB() and __DSB() macros (even in ST's currently shipped versions), which besides the respective CPU memory barrier instruction also include a compiler barrier. The performance impact of CPU memory barrier instruction without actual barrier effect is negligible - just one clock tick.

Therefore to make code correct for:

All compilers at all optimization levels
All Cortex-M cores, including Cortex-M7
All memory types, including DTCM
Memory configurations with or without MPU

The universal fix is to put __DMB() macro just before:

Setting OWN bit.
Reading other descriptor words after checking OWN bit.
Checking-resuming DMA operation.

Nick van IJzendoorn · ‎2021-01-23

Interesting read I was about to implement ethernet on the Cortex-M7 now. Nasty bug that is, could take some time to discover.

Did you also file a official bug report with ST in there ticketing system? I think more people here in the community have problems with ethernet on the Cortex-M7 because of it. :grinning_face: https://community.st.com/s/onlinesupport

Another thing i'm missing is D-Cache management if D-Cache is enabled in ETH_Transmit (SCB_CleanDCache_by_Addr on output buffer) and in ETH_Receive (SCB_InvalidateDCache_by_Addr on input buffer) ~~or is D-Cache management automatically done by the DMA controller of the MAC~~? To my knowledge this could result in corrupted packets being received or transmitted.

Edit: No D-Cache management is not automatically managed by the DMA controller. See posts:

http://lwip.100.n7.nabble.com/Problem-running-lwip-on-cortex-M7-with-D-cache-enabled-td31302.html

https://community.st.com/s/question/0D50X00009Xkgff/stm32f746-ethernet-dma-transmit-problem

https://www.freertos.org/FreeRTOS_Support_Forum_Archive/January_2017/freertos_FreeRTOS_TCP_on_STM32F7_03836162j.html

Nick van IJzendoorn · ‎2021-01-23

I think the best and easiest approach is something like this below. And maybe add __DMB() instructions / memory barriers around where Status is updated / read but I'll still have to validate that...

The only problem is that CubeMX will probably override all these files when you re-generate your project. So you must copy the files and exclude the original file from the builds.

I think this should work though. I tested it and it seems to work fine.

In ethernetif.c make the variable allocation look like this:

/* Private variables ---------------------------------------------------------*/
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4
#endif
__ALIGN_BEGIN ETH_DMADescTypeDef  DMARxDscrTab[ETH_RXBUFNB] __ALIGN_END __attribute__((section(".none_cached")));/* Ethernet Rx MA Descriptor */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4
#endif
__ALIGN_BEGIN ETH_DMADescTypeDef  DMATxDscrTab[ETH_TXBUFNB] __ALIGN_END __attribute__((section(".none_cached")));/* Ethernet Tx DMA Descriptor */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4
#endif
__ALIGN_BEGIN uint8_t Rx_Buff[ETH_RXBUFNB][ETH_RX_BUF_SIZE] __ALIGN_END __attribute__((section(".none_cached"))); /* Ethernet Receive Buffer */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4
#endif
__ALIGN_BEGIN uint8_t Tx_Buff[ETH_TXBUFNB][ETH_TX_BUF_SIZE] __ALIGN_END __attribute__((section(".none_cached"))); /* Ethernet Transmit Buffer */

The the linker script add an extra memory region:

MEMORY
{
  RAM    (xrw)    : ORIGIN = 0x20000000,   LENGTH = 288K
  RAM_NC (rw)     : ORIGIN = 0x20048000,   LENGTH = 32K
  FLASH    (rx)    : ORIGIN = 0x8000000,   LENGTH = 512K
}

/* Uninitialized data section into "RAM_NC" Ram type memory */
  . = ALIGN(4);
  .bss_nc :
  {
    /* This is used by the startup in order to initialize the .bss section */
    _sbss_nc = .;         /* define a global symbol at bss start */
    __bss_nc_start__ = _sbss_nc;
    *(.none_cached)
    *(.none_cached*)
    *(COMMON)
 
    . = ALIGN(4);
    _ebss_nc = .;         /* define a global symbol at bss end */
    __bss_nc_end__ = _ebss_nc;
  } >RAM_NC

In startup_xxxxxxxxxxxxx.s add zero filling (this assembly before the bl SystemInit instruction):

/* Zero fill the none_cachce segment. */
  ldr  r2, =_sbss_nc
  b  LoopFillZerobssnc
FillZerobssnc:
  movs  r3, #0
  str  r3, [r2], #4
 
LoopFillZerobssnc:
  ldr  r3, = _ebss_nc
  cmp  r2, r3
  bcc  FillZerobssnc

In the same file after the .word _ebss

/* start address for the .none_cached section. defined in linker script */
.word  _sbss_nc
/* end address for the .none_cached section. defined in linker script */
.word  _ebss_nc

And in main.c before initializing all the peripherals:

// configure the MPU to exclude the section .none_cached from D-cache
 
  extern uint32_t _sbss_nc;
 
  HAL_MPU_Disable();
 
  // configure the MPU to not apply caching the the .none_cached section
  MPU_Region_InitTypeDef MPU_InitStruct;
  MPU_InitStruct.Enable = MPU_REGION_ENABLE;
  MPU_InitStruct.BaseAddress = (uint32_t) &_sbss_nc;
  MPU_InitStruct.Size = MPU_REGION_SIZE_32KB;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL2;
  MPU_InitStruct.Number = MPU_REGION_NUMBER1;
  MPU_InitStruct.SubRegionDisable = 0x00;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  HAL_MPU_ConfigRegion(&MPU_InitStruct);
 
  HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);

Since the low_level_input and low_level_output copy data from and to the pbuf's which are still cached it should still be quite fast I think.

HWurs.1 · ‎2021-02-12

@Nick van IJzendoorn thanks. I'm still in the early stages of losing time on the STM32 ethernet problem, but I'll try to add to the above.

Observation: My STM32F767ZITx_FLASH.ld and startup_stm32f767xx.s files seem to be left alone when regenerating code from CubeMX 4.27 - 6.1

Hint: ethernetif.c variable attributes can be made persistent when sticking them to extern declarations.

/* Private variables ---------------------------------------------------------*/
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4   
#endif
__ALIGN_BEGIN ETH_DMADescTypeDef  DMARxDscrTab[ETH_RXBUFNB] __ALIGN_END;/* Ethernet Rx MA Descriptor */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4   
#endif
__ALIGN_BEGIN ETH_DMADescTypeDef  DMATxDscrTab[ETH_TXBUFNB] __ALIGN_END;/* Ethernet Tx DMA Descriptor */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4   
#endif
__ALIGN_BEGIN uint8_t Rx_Buff[ETH_RXBUFNB][ETH_RX_BUF_SIZE] __ALIGN_END; /* Ethernet Receive Buffer */
 
#if defined ( __ICCARM__ ) /*!< IAR Compiler */
  #pragma data_alignment=4   
#endif
__ALIGN_BEGIN uint8_t Tx_Buff[ETH_TXBUFNB][ETH_TX_BUF_SIZE] __ALIGN_END; /* Ethernet Transmit Buffer */
 
/* USER CODE BEGIN 2 */
extern ETH_DMADescTypeDef DMARxDscrTab[ETH_RXBUFNB] __attribute__((section(".none_cached")));
extern ETH_DMADescTypeDef DMATxDscrTab[ETH_TXBUFNB] __attribute__((section(".none_cached")));
extern uint8_t Rx_Buff[ETH_RXBUFNB][ETH_RX_BUF_SIZE] __attribute__((section(".none_cached*")));
extern uint8_t Tx_Buff[ETH_TXBUFNB][ETH_TX_BUF_SIZE] __attribute__((section(".none_cached*")));
/* USER CODE END 2 */

Question here would be whether the buffers shouldn't be at __attribute__((section(".none_cached*"))) for more definitive ordering?

.map result:

*(.none_cached)
 .none_cached   0x20078000      0x100 CMakeFiles/BSICS-SVR.elf.dir/Src/ethernetif.c.obj
                0x20078000                DMARxDscrTab
                0x20078080                DMATxDscrTab
 *(.none_cached*)
 .none_cached*  0x20078100     0x2fa0 CMakeFiles/BSICS-SVR.elf.dir/Src/ethernetif.c.obj
                0x20078100                Rx_Buff
                0x200798d0                Tx_Buff

In STM32F767ZITx_FLASH.ld and with RAM truncated by 32kB, I presume _estack (at 512kB it used to be 0x20080000) needs to be moved to the new end of RAM, which is now the start of RAM_NC.

/* Highest address of the user mode stack */
_estack = 0x20078000;    /* end of RAM ( = 0x20080000 - RAM_NC LENGTH ) */
 
/* Generate a link error if heap and stack don't fit into RAM */
_Min_Heap_Size = 0x200;  /* required amount of heap  */
_Min_Stack_Size = 0x400; /* required amount of stack */
 
/* Specify the memory areas */
MEMORY
{
RAM (xrw)       : ORIGIN = 0x20000000, LENGTH =  480K /* 512K - 32K */
RAM_NC (rw)     : ORIGIN = 0x20078000, LENGTH =   32K /* ORIGIN = 0x20000000 + 480 * 1024 */
FLASH (rx)      : ORIGIN = 0x00200000, LENGTH = 2048K
}

The webserver example shows an MPU configuration which covers all RAM_NC with MPU_REGION_NUMBER0, but overlays the first 256 Byte in RAM_NC with MPU_REGION_NUMBER1~~, and I cannot quite follow why~~. The 256 Byte region changes isBufferable to MPU_ACCESS_BUFFERABLE.

This appears to coincide with 2* sizeof(ETH_DMADescTypeDef) - is that a given or mere coincidence and subject to change with settings?

For completeness, here's MPU_Config() in main.c (it uses MPU_ACCESS_SHAREABLE just like the webserver example, does that matter?)

MPU_Region_InitTypeDef MPU_InitStruct = {0};
 
  /* Disables the MPU */
  HAL_MPU_Disable();
  /** Initializes and configures the Region and the memory to be protected
  */
  MPU_InitStruct.Enable = MPU_REGION_ENABLE;
  MPU_InitStruct.Number = MPU_REGION_NUMBER0;
  MPU_InitStruct.BaseAddress = (uint32_t) &_sbss_nc;
  MPU_InitStruct.Size = MPU_REGION_SIZE_32KB;
  MPU_InitStruct.SubRegionDisable = 0x0;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;
 
  HAL_MPU_ConfigRegion(&MPU_InitStruct);
  /** Initializes and configures the Region and the memory to be protected
  */
  MPU_InitStruct.Enable = MPU_REGION_ENABLE;
  MPU_InitStruct.Number = MPU_REGION_NUMBER1;
  MPU_InitStruct.BaseAddress = (uint32_t) &_sbss_nc;
  MPU_InitStruct.Size = MPU_REGION_SIZE_256B;
  MPU_InitStruct.SubRegionDisable = 0x0;
  MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL0;
  MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
  MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
  MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;
  MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
  MPU_InitStruct.IsBufferable = MPU_ACCESS_BUFFERABLE;
 
  HAL_MPU_ConfigRegion(&MPU_InitStruct);
  /* Enables the MPU */
  HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);

Nick van IJzendoorn · ‎2021-04-01

Sorry missed the email that a follow-up post was made...

I made the changes persistent by copying the file and postpend the name with _nc (thus creating ethernetif_nc.c) and excluded the original from the build process. Didn't know you could add the attributes to the extern declarations, neat trick.

I think the ordering doesn't matter (for the extra MPU configuration it is necessary though to make sure the DMA descriptors stay in the right place.) because if everything is ok everything will stay in bounds of the declared space. Why would you want to explicitly order them?

Also they will be added to the section in the order of declaration so since they are already in the order you enforced in the C file they will be ordered exactly the same (ps. I changed the names of the sections afterwards to be more uniform).

.nc_bss         0x0000000020048000     0x30b0
                0x0000000020048000                _sbss_nc = .
                0x0000000020048000                __bss_nc_start__ = _sbss_nc
 *(.nc_bss)
 .nc_bss        0x0000000020048000     0x30a0 LWIP/Target/ethernetif_nc.o
                0x0000000020048000                DMATxDscrTab
                0x0000000020048080                DMARxDscrTab
                0x0000000020048100                Tx_Buff
                0x00000000200498d0                Rx_Buff
 .nc_bss        0x000000002004b0a0       0x10 Source/analog/analog.o
 *(.nc_bss*)
 *(COMMON)
                0x000000002004b0b0                . = ALIGN (0x4)
                0x000000002004b0b0                _ebss_nc = .
                0x000000002004b0b0                __bss_nc_end__ = _ebss_nc

About the stackpointer, yes indeed it's wise to relocate the initial location of the stackpointer to the RAM region to leverage D-Cache. I myself moved the stackpointer to the DTCM region later on to take advantage of the 0-waitstate memory and may increase performance by not using D-Cache lines for the stack. This leaves more cache line's available for the bss and data region.

The MPU configuration is a little faster I think. I belief shareable informs the MPU that both the DMA controller and the processor may access the region simultaneously. If shareable is not set I think this is prohibited and they need to wait for each other when accessing the region.

I don't know why the made the DMA descriptors bufferable though, my guess is that this is wrong. Because if you make the descriptors bufferable you can get a race condition between updating the descriptors and setting the OWN bit again. I think at leased. But maybe I'm wrong.

More information about the inner workings of the MPU can be found here: https://www.st.com/resource/en/application_note/dm00272912-managing-memory-protection-unit-in-stm32-mcus-stmicroelectronics.pdf (AN4838) Maybe there is a better answer in there than mine. :D

jacksong · ‎2021-06-09

Several years ago, I also suffered some H7 ethernet issue, hope this post can also help some developers.

pls see: https://community.st.com/s/feed/0D50X0000B29EGQSQ2

HWurs.1 · ‎2021-06-22

I spent a few more days collecting information on the MPU. As this was my first in-depth look at MPU use, I put together a hopefully comprehensive introduction:

https://github.com/MisterHW/BSICS-SVR/blob/master/doc/readme_mpu.md

My starting point was an "STM32 MPU tips" online course - as there was a lot more information in the broken text-to-speech track, I fed it into otter.io and then manually fixed the script (pdf available).

Looking back at my post above, I'm noticing

the lack of background region to prevent speculative access to external memory
0x20078000 does not coincide with STM32F767 SRAM2 -> 0x2007C000
if one were to be serious about memory layout and use of SRAM2, STM32F7 examples should use SRAM2 and MPU_REGION_SIZE_16KB. All STM32F7 controllers have 16 kB SRAM2, so this is probably left-over confusion from STM32H7 code.
using 12448 Bytes SRAM2 for ethernet buffers + descriptors would leave 3936 Bytes for stack in SRAM2. Stack size in .ld is 0x400 for main() and ISRs. FreeRTOS allocates thread stacks from heap. Hence, 1 kB should be enough for the stack in SRAM2. Moving _estack to the end of SRAM1 still seems appropriate.

There are still many things left which I don't understand, so any feedback is welcome.

For example, I'm wondering what "inner and outer policies" are when there is only L1 cache.

I also couldn't make sense of the TEX field, there doesn't seem to be a pattern behind its bits. Most documentation refers to the table of allowed MPU configurations and recommends selecting a known good (TEX, C, B) tuple. Feels like optimized black-box behavior at this point.

> why they made the DMA descriptors bufferable

suspicion: here B=0: Strongly Ordered, B=1: Device memory. Only the latter allows unaligned access, removing an unnecessary restriction.

> it's wise to relocate the initial location of the stackpointer to the RAM region to leverage D-Cache. I myself moved the stackpointer to the DTCM region later on to take advantage of the 0-waitstate memory and may increase performance by not using D-Cache lines for the stack.

Neat, will consider. Luckily that's where the stack question separates from ethernet DMA memory, so I'll let future me figure it out :)

ps. I've also added (NOLOAD). After understanding what

Warn : no flash bank found for address 0x2007C000

wanted to tell me (and realizing that there is a 522 MB .bin file on account of 0x0 - 0x20080000 being covered). I considered this hint.

/* Uninitialized data section into "RAM_NC" Ram type memory */
  . = ALIGN(4);
  .bss_nc (NOLOAD) :
  {
    /* This is used by the startup in order to initialize the .bss_nc section */
    _sbss_nc = .;         /* define a global symbol at bss_nc start */
    __bss_nc_start__ = _sbss_nc;
    *(.nc_bss)
    *(.nc_bss*)
    *(COMMON)
 
    . = ALIGN(4);
    _ebss_nc = .;         /* define a global symbol at bss_nc end */
    __bss_nc_end__ = _ebss_nc;
  } >RAM_NC

LCE · ‎2021-11-12

Thank you so much for this post!

I had lwIP with http and PTP running on 2 STM32F767 Nucleo boards.

After fiddling with the PTP servo settings I got synchronization quite reliably around and below 100ns.

Then I found out I was working in compiler's DEBUG mode (with STM32CubeIDE), switched to release mode, worked with that a few days - but only with one Nucleo connected.

Then I flashed and connected the 2nd Nucleo with the relase version - and BAM! No more PTP sync.

First I thought it might be my PTP servo / timing stuff, put it back to default settings - no improvement whatsoever.

Then I checked the ethernet stuff and found out that in release mode:

in low_level_output() (calling HAL_ETH_TransmitFrame() (both modified for HW PTP) ) in lwIP's ethernetif.c
the checking of the transmit timestamp status (DmaTxDesc->Status & ETH_DMATXDESC_TTSS) had a VERY low success rate (timeout), but not in debug mode

Then I found some "__DMB()" in my code, which I put there after having found this post quite in the beginning of this project, so I researched again, found this post again, understood a little more about it, and followed the above instructions and inserted some more __DMB's.

I had already put DMB before / after most DMA / OWN uses, but not the __DMB() before resuming DMA transfers - and that did it!

Holy smokes, am I happy that this is working now in release mode.

THANKS!

kens0o · ‎2022-07-07

Hi,

regarding the new H7 firmware 1.10, I couldn't find code which did not use the __DMB() macro before setting the OWN bit. Also I couldn't find other descriptor words modified after checking the OWN bit.

Seems to be fixed, right?

But I can't figure out where the DMA transfers are resumed. I tried inserting the __DMB() macro in HAL_ETH_ErrorCallback() without success. Can you please give more detailed instructions for noobs like me?

Thanks!

Piranha · ‎2022-07-20

Yes, the missing memory barrier issue is fixed in the reworked drivers. I've updated my Ethernet/lwIP issue list accordingly.

I'll answer the second question in your topic:

https://community.st.com/s/question/0D53W00001dwZkUSAU/problems-with-touchgfx-generated-project-and-lwipethernet