WebSocket Server on STM32H7xx with Azure RTOS and NetXDuo — A Practical Guide

Intector · ‎2026-02-27

Hello everyone,

If you've ever tried running Azure RTOS with an HTTP server, an FTP server, multiple TCP sockets, and a handful of threads on a standard NUCLEO-144 board, you know exactly where this is going.

You hit the memory wall.

1 MB of internal RAM sounds like a lot — until ThreadX, NetXDuo, and FileX each want their own byte pools, packet pools, and thread stacks. Add an HTTP server for a web interface, an FTP server for remote file access, and you're fighting for every kilobyte before your actual application even starts. And storage? The NUCLEO boards don't have any. If you want to serve a web dashboard or datalog sensor readings, you're out of luck.

Then there's the documentation gap. ST provides excellent examples for HTTP servers with NetXDuo. FTP gets reasonable coverage. But WebSockets? Real-time bidirectional communication between a browser and your STM32? Good luck finding an ST example for that. And it's the one thing that turns a static web page into a live control interface.

This post covers how I solved both problems — the memory wall and the WebSocket gap — with a custom memory expansion shield and a from-scratch RFC 6455 WebSocket server running on ThreadX and NetXDuo. After some time tinkering, I would like to let @AS5, @mbarg.1, @Paragon10 know about my WebSocket protocol solution — your posts and questions on these topics helped shape the direction. The full source code is available on GitHub, and this writeup explains the key design decisions, the implementation details, and the gotchas I ran into along the way.

The Hardware: NUCLEO-MEM

Rather than keep working around the NUCLEO-144 limitations, I designed a purpose-built memory expansion shield that plugs directly into the Morpho headers.

[INSERT IMAGE: pic_01.png — PCB front and back]

The NUCLEO-MEM carries:

8 GB eMMC (Kingston EMMCxxG-MV28) — connected via SDMMC1 (8-bit). Stores the web interface files (HTML, CSS, JavaScript) and provides gigabytes of datalogging capacity.
8 MB Octal SPI PSRAM (AP Memory APS6408L-OBM) — connected via OCTOSPI1. This is the memory that Azure RTOS actually needs to breathe.
UHF RFID (Impinj Monza4QT) — unique device identification at up to 10 meters.

In the photos you'll also see an SSD1306 OLED display (128×32, I2C) — that's a separate module plugged into the NUCLEO-MEM headers, not part of the shield itself. It shows network status, IP address, and runtime information.

The NUCLEO-MEM requires an MCU with an OCTOSPI peripheral for the PSRAM interface. It has been validated on the NUCLEO-H723ZG (STM32H723, Cortex-M7, 550 MHz). MCUs without OCTOSPI — like the STM32H753 with its single QUADSPI limited to flash memory — can't use the PSRAM portion. For those boards, I have an earlier EMMCxxG-MV28 shield that carries only the eMMC storage — no PSRAM, but it gives any MCU with an SDMMC peripheral access to 8 GB of FAT-based file storage for datalogging, web hosting, or automated data offload via FTP.

Designed in Altium Designer, four-layer PCB, all 0402 passives, impedance-controlled memory traces.

Architecture: Three Servers Working Together

The firmware runs three network servers simultaneously on ThreadX, each with a specific role:

HTTP Server (port 80) — Serves the web dashboard from eMMC via FileX. Static files — HTML, CSS, JavaScript, fonts. This is NetXDuo's built-in nx_web_http_server, well-documented by ST.
FTP Server (port 21) — Remote file access to the eMMC filesystem. Upload updated web files, download logged data, all without reflashing the MCU. Also built into NetXDuo.
WebSocket Server (port 8080) — Real-time bidirectional JSON communication. This is the piece ST doesn't provide. Built from scratch on top of NetXDuo's TCP socket API using RFC 6455.

HTTP delivers the user interface. FTP maintains it. WebSocket makes it live.

When a user opens the dashboard in a browser, the HTTP server delivers the HTML, CSS, and JavaScript from eMMC. The JavaScript immediately opens a WebSocket connection back to the board on port 8080. From that point on, the browser and the MCU are talking directly — telemetry flows from the board to the browser (configurable rate, ~1 Hz in the demo), and commands flow from the browser to the board instantly. No polling, no page refreshes.

This same WebSocket architecture scales from simple demos to real hardware testing. The screenshots above show the demo project (LED control + telemetry). The screenshot below shows the same platform driving a wireless power transfer test bench — a 24-40V boost converter with a full H-bridge generating 900VAC at 160kHz for magnetic field power transfer.

The demo project in the GitHub repo is intentionally simple, so the WebSocket implementation is easy to follow. Your application replaces the LED commands and telemetry values with whatever your hardware actually does.

Memory Layout: Why External PSRAM Matters

Here's where the memory wall becomes concrete. This is the linker script's MEMORY block for the NUCLEO-H723ZG:

MEMORY
{
  FLASH                 (rx)    : ORIGIN = 0x08000000, LENGTH = 896K
  DTCMRAM               (xrw)   : ORIGIN = 0x20000000, LENGTH = 128K
  RAM                   (xrw)   : ORIGIN = 0x24000A00, LENGTH = 325120

  /* PSRAM — ThreadX heap */
  Mem_TxPool            (xrw)   : ORIGIN = 0x90000000, LENGTH = 500K
  Mem_FxPool            (xrw)   : ORIGIN = 0x9007D000, LENGTH = 500K
  Mem_NxPool            (xrw)   : ORIGIN = 0x900FA000, LENGTH = 500K
  Mem_FTP_Server        (xrw)   : ORIGIN = 0x90177000, LENGTH = 100K
  ...
}

The AXI SRAM (RAM) is only ~317 KB. The three RTOS byte pools alone need 1.5 MB:

#define TX_APP_MEM_POOL_SIZE    1024 * 500   // 500 KB — ThreadX threads & stacks
#define FX_APP_MEM_POOL_SIZE    1024 * 500   // 500 KB — FileX media buffers
#define NX_APP_MEM_POOL_SIZE    1024 * 500   // 500 KB — NetXDuo IP, ARP, server stacks

That's already more than the H723's total internal RAM. These pools are placed in external PSRAM via linker section attributes:

static UCHAR __attribute__((section(".TxPoolSection"))) tx_byte_pool_buffer[TX_APP_MEM_POOL_SIZE];
static UCHAR __attribute__((section(".FxPoolSection"))) fx_byte_pool_buffer[FX_APP_MEM_POOL_SIZE];
static UCHAR __attribute__((section(".NxPoolSection"))) nx_byte_pool_buffer[NX_APP_MEM_POOL_SIZE];

Without external PSRAM, you simply cannot run HTTP + FTP + WebSocket simultaneously on a NUCLEO-144 board. The internal RAM runs out before your application even starts.

The DMA Problem — A Hard Lesson

The natural first approach is to put everything in PSRAM — there's 8 MB, plenty of room. And that's what I did initially: all three server packet pools lived in PSRAM linker sections. The byte pools worked fine. The FTP server worked perfectly.

Then the HTTP server started misbehaving. Pages would load partially, CSS files arrived corrupted, requests would silently time out. Same memory region as FTP, same bus, completely different behavior.

The answer is in the NetXDuo Ethernet driver. When NetXDuo transmits a packet, the driver hands the packet buffer pointer directly to the Ethernet DMA hardware:

// nx_stm32_eth_driver.c — _nx_driver_hardware_packet_send()
Txbuffer[i].buffer = pktIdx->nx_packet_prepend_ptr;  // this pointer goes to ETH DMA
HAL_ETH_Transmit_IT(&eth_handle, &TxPacketCfg);      // DMA reads from that pointer

If that packet pool is in PSRAM (behind the OCTOSPI controller at 0x90000000), the Ethernet DMA must read through OCTOSPI. For sequential transfers like FTP, this works fine. But a browser opening a dashboard fires 10–15 concurrent requests — HTML, CSS, JS, images, fonts — generating bursts of small packets from different thread contexts, all competing for the OCTOSPI bus. Under that interleaved access pattern, packets get corrupted or the DMA stalls.

FTP works from PSRAM — sequential file transfers, clean access pattern.

HTTP fails from PSRAM — concurrent small packets, interleaved OCTOSPI access.

WebSocket has the same problem — frequent small JSON frames through the TCP stack.

The solution has two parts.

Part 1 — Move packet pools to AXI SRAM. The high-throughput packet pools go into AXI SRAM (.dma_buffer linker section) where the Ethernet DMA can access them directly. The FTP pool stays in PSRAM since its sequential access pattern works fine there:

// AXI SRAM — ETH DMA can access directly, no OCTOSPI latency
static uint8_t __attribute__((section(".dma_buffer"))) eth_packet_pool_buffer[NX_ETH_PACKET_POOL_SIZE];
static uint8_t __attribute__((section(".dma_buffer"))) nx_http_server_pool[HTTP_SRV_PACKET_POOL_SIZE];
static uint8_t __attribute__((section(".dma_buffer"))) nx_ws_server_pool[WS_SRV_PACKET_POOL_SIZE];

// PSRAM — FTP's sequential transfers work fine through OCTOSPI
static uint8_t __attribute__((section(".Nx_FTP_ServerPoolSection"))) nx_ftp_server_pool[FTP_SRV_PACKET_POOL_SIZE];

Part 2 — Restrict HTTP to single-session. Moving the pools to AXI SRAM solves the corruption, but AXI SRAM is only ~317 KB. The default NetXDuo HTTP server configuration allows 2 concurrent sessions — meaning more TCP sockets, more packets in flight, and a larger pool needed in that precious AXI SRAM.

The fix: set NX_WEB_HTTP_SERVER_SESSION_MAX to 1 in nx_user.h:

/* nx_user.h — override to minimize AXI SRAM packet pool usage.
   Browser requests are serialized over a single connection. */
#define NX_WEB_HTTP_SERVER_SESSION_MAX          1

This overrides the NetXDuo middleware default:

// nx_web_http_server.h — default
#ifndef NX_WEB_HTTP_SERVER_SESSION_MAX
#define NX_WEB_HTTP_SERVER_SESSION_MAX          2
#endif

#define NX_WEB_HTTP_SERVER_MAX_PENDING  (NX_WEB_HTTP_SERVER_SESSION_MAX << 1)

Each session allocates a TCP socket and internal buffers, and the pending connection queue scales with it. Cutting from 2 to 1 significantly reduces the HTTP server's memory footprint. Modern browsers handle this gracefully — they pipeline requests over one connection.

With both changes applied, AXI SRAM dropped from ~94% to 46.78% — comfortable headroom instead of a knife's edge. External PSRAM gives you room for the byte pools and FTP packet pool, while AXI SRAM handles the DMA-critical paths.

WebSocket Protocol: What You Need to Know

If you haven't implemented WebSockets before, here's a brief primer. RFC 6455 upgrades an HTTP connection to a persistent, full-duplex TCP channel. Unlike HTTP request-response, a WebSocket connection stays open — both sides can send data at any time.

The Handshake

Every WebSocket connection starts with an HTTP upgrade request. The browser sends a standard HTTP GET with special headers:

GET / HTTP/1.1
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==

The server must:

Extract the Sec-WebSocket-Key from the request
Concatenate it with the magic GUID 258EAFA5-E914-47DA-95CA-C5AB0DC85B11
Compute the SHA-1 hash of the result
Base64-encode the hash
Send back an HTTP 101 response with the computed accept key

If the accept key doesn't match what the browser expects, the connection is rejected. This is the part that trips up most embedded implementations — you need a working SHA-1 and base64 on your MCU.

Frame Format

After the handshake, communication switches to WebSocket frames. Each frame has a compact binary header:

Byte 0: FIN bit (is this the last fragment?) + opcode (0x01 = text, 0x02 = binary, 0x08 = close, 0x09 = ping, 0x0A = pong)
Byte 1: Mask bit + payload length (7 bits for lengths < 126, extended to 16 or 64 bits for larger payloads)
Bytes 2–5 (if masked): 4-byte masking key

Client-to-server frames are always masked — the payload bytes are XOR'd with a rotating 4-byte key. Server-to-client frames are never masked. This asymmetry is part of the spec.

Implementation: WebSocket Server on NetXDuo

The WebSocket server runs as a dedicated ThreadX thread with its own TCP socket on port 8080, completely separate from the HTTP server. Here's how the pieces fit together.

Server Creation and Startup

All three servers are created in MX_NetXDuo_Init() — the NetXDuo byte pool provides the thread stacks, and each server gets its own packet pool:

// WebSocket Server (port 8080) — packet pool in AXI SRAM, stack in PSRAM byte pool
nx_packet_pool_create(&WS_ServerPacketPool, "WS Server Packet Pool",
                      WS_SERVER_PACKET_SIZE, nx_ws_server_pool,
                      WS_SRV_PACKET_POOL_SIZE);

tx_byte_allocate(byte_pool, (VOID **)&pointer, WS_SRV_STACK_SIZE, TX_NO_WAIT);

WS_Server_Create(&WebSocket_Server, &IpInstance,
                 &WS_ServerPacketPool, pointer);

The servers start in sequence after all prerequisites (ThreadX, FileX, eMMC) are ready, synchronized through ThreadX event flags:

// Wait for all subsystems
tx_event_flags_get(&TAGID_status_event_group, _EventFlags,
                   TX_AND, &tmp_actual_events, TX_WAIT_FOREVER);

// Start servers in order
nx_web_http_server_start(&HTTP_Server);          // port 80
nx_ftp_server_start(&FTP_Server);                // port 21
WS_Server_Start(&WebSocket_Server);              // port 8080

The Accept Thread — Connection Lifecycle

The WebSocket server runs a single accept thread that listens for incoming TCP connections. When a client connects, it performs the handshake, enters a receive loop, and cleans up when the client disconnects:

void WS_Server_AcceptThread_Entry(ULONG thread_input)
{
    WS_Server_t *server = (WS_Server_t *)thread_input;

    while (1) {
        // Block until a client connects
        status = nx_tcp_server_socket_accept(&server->listen_socket, TX_WAIT_FOREVER);

        if (status == NX_SUCCESS) {
            // Perform WebSocket handshake (HTTP upgrade)
            status = ws_perform_handshake(&server->listen_socket, server->packet_pool);

            if (status == NX_SUCCESS) {
                // Send initial state to the new client
                WS_SendLedStatus();

                // Receive loop — runs until client disconnects
                while (socket->nx_tcp_socket_state == NX_TCP_ESTABLISHED) {
                    WS_ReceiveFrame(&server->listen_socket, server->packet_pool);
                    tx_thread_sleep(1);  // yield to other threads
                }

                // Client gone — clean up and relisten
                nx_tcp_server_socket_unaccept(&server->listen_socket);
                nx_tcp_server_socket_relisten(server->ip_instance,
                                              WS_SERVER_PORT,
                                              &server->listen_socket);
            }
        }
    }
}

This is a single-client design — one WebSocket connection at a time. For a lab test bench or development dashboard, that's the right trade-off: simpler code, lower memory footprint, and no need to manage concurrent client state. The server structure supports up to WS_MAX_CLIENTS (4) slots for future expansion.

The Handshake — SHA-1 on a Microcontroller

The handshake function receives the raw HTTP upgrade request from the TCP socket, extracts the client's WebSocket key, computes the accept key, and sends back the 101 response:

static UINT ws_perform_handshake(NX_TCP_SOCKET *socket_ptr, NX_PACKET_POOL *pool_ptr)
{
    // Receive and validate the HTTP upgrade request
    status = nx_tcp_socket_receive(socket_ptr, &request_packet, 5 * NX_IP_PERIODIC_RATE);

    // Verify it's a GET request with Upgrade: websocket
    if (ws_strnstr(msg, "GET ", len) == NULL) return NX_NOT_SUCCESSFUL;

    upgrade_ptr = ws_strnstr(msg, "Upgrade:", len);
    if (upgrade_ptr == NULL || ws_strnstr(upgrade_ptr, "websocket", 50) == NULL)
        return NX_NOT_SUCCESSFUL;

    // Extract Sec-WebSocket-Key
    key_ptr = ws_strnstr(msg, "Sec-WebSocket-Key:", len);
    // ... trim whitespace, copy key ...

    // Compute accept key: SHA-1(client_key + GUID), then base64
    ws_compute_accept_key(accept_key, client_key);

    // Send HTTP 101 Switching Protocols
    snprintf(response_header, sizeof(response_header),
             "HTTP/1.1 101 Switching Protocols\r\n"
             "Upgrade: websocket\r\n"
             "Connection: Upgrade\r\n"
             "Sec-WebSocket-Accept: %s\r\n\r\n",
             accept_key);

    // Allocate packet, append response, send
    nx_packet_allocate(pool_ptr, &response_packet, NX_TCP_PACKET, TX_WAIT_FOREVER);
    nx_packet_data_append(response_packet, response_header, strlen(response_header), ...);
    nx_tcp_socket_send(socket_ptr, response_packet, TX_WAIT_FOREVER);

    return NX_SUCCESS;
}

The SHA-1 computation (ws_compute_accept_key) and base64 encoding are implemented from scratch in ws_server.c. No external crypto library required — the SHA-1 implementation is a compact, self-contained function that operates on 512-bit blocks. The full implementation is in the GitHub repo; I won't reproduce the entire SHA-1 here, but the key point is that you concatenate the client's key with the WebSocket magic GUID (258EAFA5-E914-47DA-95CA-C5AB0DC85B11), hash it, and base64-encode the result.

Frame Parsing — Unmasking Client Data

When the browser sends a message (like an LED command), it arrives as a masked WebSocket frame. The server must parse the header and unmask the payload:

static UINT WS_ReceiveFrame(NX_TCP_SOCKET *socket_ptr, NX_PACKET_POOL *pool_ptr)
{
    status = nx_tcp_socket_receive(socket_ptr, &packet_ptr, NX_NO_WAIT);
    if (status != NX_SUCCESS) return status;

    // Parse frame header
    unsigned char opcode     = data[0] & 0x0F;
    unsigned char masked     = (data[1] & 0x80) != 0;
    uint64_t      payload_len = data[1] & 0x7F;

    // Extended payload length (126 = 16-bit, 127 = 64-bit)
    if (payload_len == 126) {
        payload_len = (data[2] << 8) | data[3];
        header_len = 4;
    }

    // Extract masking key (4 bytes, always present from browser)
    if (masked) {
        memcpy(mask, &data[header_len], 4);
        header_len += 4;
    }

    // Unmask payload — XOR each byte with rotating mask
    unsigned char *payload = &data[header_len];
    for (uint32_t i = 0; i < payload_len; i++) {
        payload[i] ^= mask[i % 4];
    }

    // Route by opcode
    if (opcode == WS_OPCODE_TEXT) {
        // JSON command — parse and handle
        WS_HandleJSONCommand(json_buffer);
    }
    else if (opcode == WS_OPCODE_CLOSE) {
        return NX_NOT_CONNECTED;
    }
}

Sending Frames — Server to Client

Sending is simpler than receiving because server-to-client frames are never masked. The WS_Server_SendJSON function builds the WebSocket frame header and appends the JSON payload:

UINT WS_Server_SendJSON(WS_Server_t *server, const char *json_string)
{
    uint32_t json_len = strlen(json_string);

    // Build WebSocket TEXT frame header
    frame_header[0] = WS_FIN_FLAG | WS_OPCODE_TEXT;  // 0x81 — final frame, text

    if (json_len < 126) {
        frame_header[1] = (unsigned char)json_len;
        header_len = 2;
    } else if (json_len < 65536) {
        frame_header[1] = 126;
        frame_header[2] = (json_len >> 8) & 0xFF;
        frame_header[3] = json_len & 0xFF;
        header_len = 4;
    }

    // Allocate packet, append header + JSON, send
    nx_packet_allocate(server->packet_pool, &packet_ptr, NX_TCP_PACKET, NX_NO_WAIT);
    nx_packet_data_append(packet_ptr, frame_header, header_len, ...);
    nx_packet_data_append(packet_ptr, (void *)json_string, json_len, ...);
    nx_tcp_socket_send(socket_ptr, packet_ptr, NX_NO_WAIT);
}

Command Handling — cJSON on the MCU

Incoming JSON commands are parsed with the lightweight cJSON library. The command dispatcher checks the type field and routes to the appropriate handler:

void WS_HandleJSONCommand(const char *json_string)
{
    cJSON *root = cJSON_Parse(json_string);
    cJSON *type = cJSON_GetObjectItemCaseSensitive(root, "type");

    if (strcmp(type->valuestring, "led-command") == 0) {
        handle_led_command(root);
    }

    cJSON_Delete(root);
}

static void handle_led_command(cJSON *root)
{
    cJSON *led   = cJSON_GetObjectItemCaseSensitive(root, "led");
    cJSON *state = cJSON_GetObjectItemCaseSensitive(root, "state");

    GPIO_PinState pin_state = cJSON_IsTrue(state) ? GPIO_PIN_SET : GPIO_PIN_RESET;

    if (strcmp(led->valuestring, "yellow") == 0) {
        HAL_GPIO_WritePin(LED_Yellow_GPIO_Port, LED_Yellow_Pin, pin_state);
    }
    // ... red LED handling ...

    WS_SendLedStatus();  // confirm state back to client
}

The JSON message format is straightforward. Browser sends:

{"type": "led-command", "led": "yellow", "state": true}

Server responds with the current state:

{"type": "led-status", "data": {"yellowLed": true, "redLed": false}}

Telemetry — Streaming Data to the Browser

A dedicated telemetry thread reads sensor data and broadcasts it to the connected client. The update rate is configurable — set to ~1 Hz in the demo, but adjustable via TELEMETRY_UPDATE_RATE_MS in ws_telemetry.h:

void tx_telemetry_thread_entry(ULONG thread_input)
{
    // Wait for HTTP server to be ready (event flag synchronization)
    tx_event_flags_get(&TAGID_status_event_group, TAGID_SE_HTTP_SERVER_OK,
                       TX_AND, &tmp_actual_events, TX_WAIT_FOREVER);

    while (1) {
        float junction_temp = DTS_ReadTemperature();

        // Build JSON with cJSON
        cJSON *msg  = cJSON_CreateObject();
        cJSON *data = cJSON_CreateObject();
        cJSON_AddNumberToObject(data, "junctionTemp", junction_temp);
        cJSON_AddNumberToObject(data, "uptime", (double)uptime);
        // ... random float and int for demo ...
        cJSON_AddStringToObject(msg, "type", "telemetry");
        cJSON_AddItemToObject(msg, "data", data);

        char *json_string = cJSON_PrintUnformatted(msg);
        WS_Server_SendJSON(&WebSocket_Server, json_string);

        cJSON_free(json_string);
        cJSON_Delete(msg);

        tx_thread_sleep(TELEMETRY_UPDATE_TICKS);  // ~1 second
    }
}

The browser receives:

{"type": "telemetry", "data": {"junctionTemp": 42.0, "randomFloat": 73.21, "randomInt": 456, "uptime": 12345}}

This is where the WebSocket advantage becomes obvious. With HTTP polling, you'd be hammering the server with GET requests at whatever rate you need updates. With WebSocket, the server pushes data when it's ready — no request overhead, no wasted bandwidth, sub-millisecond latency.

The Result

The demo project gives you a self-contained reference implementation:

A browser-based dashboard served from eMMC over HTTP
Real-time LED control over WebSocket — toggle LEDs from the browser, see state confirmed instantly
Live telemetry streaming — MCU junction temperature, hardware RNG values, RTOS uptime, all updating in real time at a configurable rate
FTP access to the eMMC filesystem for updating web files remotely
Full RFC 6455 WebSocket implementation with handshake, frame encoding/decoding, and connection management

The web interface uses Bootstrap 5 with a dark theme, vanilla JavaScript, and connects to the WebSocket server automatically on page load. The WebSocket log panel shows every message in real time for debugging.

Replace the LED commands with your motor controller, sensor array, or power converter, and replace the telemetry values with your actual measurements. The WebSocket plumbing stays the same.

Getting Started

The complete source code is available on GitHub:

https://github.com/intector/NUCLEO-MEM

License: MIT — use it however you need.

What You Need

NUCLEO-H723ZG development board (or another NUCLEO-144 with OCTOSPI peripheral)
NUCLEO-MEM shield (provides the eMMC and PSRAM)
Ethernet connection (board defaults to 192.168.0.50 static IP)
Visual Studio + VisualGDB (primary development environment) — or adaptable to STM32CubeIDE with some effort

Key Dependencies

Azure RTOS / ThreadX — RTOS kernel, thread management, event flags
NetXDuo — TCP/IP stack, HTTP server, FTP server
FileX — FAT filesystem on eMMC
cJSON — lightweight JSON parser (included in the repo)
STM32CubeH7 HAL drivers

Quick Test

Once the board is running and connected to your network, open a browser to http://192.168.0.50 for the web dashboard. The WebSocket connection establishes automatically. You can also test the WebSocket directly from a browser console:

const ws = new WebSocket('ws://192.168.0.50:8080');
ws.onmessage = (e) => console.log(JSON.parse(e.data));
ws.send(JSON.stringify({type: 'led-command', led: 'yellow', state: true}));

Summary — Memory Budget

For reference, here's what the full system looks like in terms of memory allocation:

Resource Size Location

ThreadX byte pool	500 KB	PSRAM
FileX byte pool	500 KB	PSRAM
NetXDuo byte pool	500 KB	PSRAM
ETH packet pool	~93 KB	AXI SRAM
HTTP server packet pool	30 KB	AXI SRAM
WebSocket server packet pool	20 KB	AXI SRAM
FTP server packet pool	100 KB	PSRAM
AXI SRAM total	148 KB / 317 KB (46.78%)	After single-session HTTP

The three byte pools alone (1.5 MB) exceed the H723's total internal RAM. Without external PSRAM, this configuration is impossible. And even with PSRAM, the DMA-accessible packet pools still need AXI SRAM — which is why you end up carefully placing buffers across two memory regions. The single-session HTTP optimization (NX_WEB_HTTP_SERVER_SESSION_MAX = 1) is what brought AXI SRAM from a dangerous 94% down to a comfortable 47%.

I hope this helps someone struggling with the same problems. WebSocket support on STM32 is one of those things that feels like it should be straightforward but turns into a multi-week project when you realize there are no examples, no middleware, and no documentation.

The GitHub repo has the complete source code. If you have questions, suggestions, or find bugs — post them here or open an issue on GitHub. Contributions are welcome.

Thank you, and to everyone else out there:

"The secret is to keep banging the rocks together, guys."

Kai Jensen — Intector Inc.

mƎALLEm · ‎2026-03-02

Hello @Intector and thank your for the sharing.

I'll mark that post accepted as solution to make more visibility for the community users.

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

View solution in original post

mƎALLEm · ‎2026-03-02

Hello @Intector and thank your for the sharing.

I'll mark that post accepted as solution to make more visibility for the community users.

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

Kraal · ‎2026-03-02

Thank you for the detailed post, very informative.

Are there any advantages going with eMMC instead of an SD card ?

Best regards

Andrew Neil · ‎2026-03-02

For one thing, not having a socket can be a big advantage in many applications.

For others, try an internet search for "emmc vs sd card", or similar;

eg, https://www.google.com/search?q=eMMC+vs+SD+Card

(other search engines are available)

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

Pavel A. · ‎2026-03-02

Thank you @Intector for very interesting project.

Actually, the explanation of TCP vs UDP behavior seems a bit unclear. The ETH DMA bypasses DCACHE of the MCU. But does the OSPI controller synchronize accesses of the MCU and other master (ETH) in memory-mapped mode, or it is undefined behavior?

Kraal · ‎2026-03-03

In this application the eMMC is used for keeping the webserver files and do datalogging (with a FAT filesystem). If the webserver crashes then popping out the SD card and using a SD card reader to read back the logs is certainly easier than accessing the eMMC. That is why I asked @Intector about the memory type choice.

Intector · ‎2026-03-26

Hi @Kraal,

That's a fair point — an SD card is definitely more convenient for pulling logs manually. The choice of eMMC here was deliberate: it's soldered directly to the board, which eliminates the socket (a common failure point) and prevents accidental ejection. It's also a requirement in some industries where removable storage is not permitted for security or compliance reasons.

For a lab or development scenario where quick log access matters, an SD card is a perfectly reasonable alternative — FileX supports it just as well.

Best regards,

Kai Jensen

Intector · ‎2026-03-26

Thank you @Pavel A. that's an excellent question — it gets to the heart of what I ran into.

The ETH driver (nx_stm32_eth_driver.c) passes packet buffer pointers directly to the DMA TX descriptors:

Txbuffer[i].buffer = pktIdx->nx_packet_prepend_ptr;

So wherever a packet pool lives in memory, that's where the Ethernet MAC DMA reads from on transmit. I originally placed the HTTP, WebSocket, and FTP packet pools in PSRAM (APS6408L via OCTOSPI1, memory-mapped at 0x90000000). HTTP and WebSocket traffic caused corruption and DMA stalls. FTP continued to work. The fix was to move the latency-sensitive pools into AXI SRAM while keeping FTP in PSRAM:

Pool Location MPU Config

ETH main (60 × 1536B)	AXI SRAM .dma_buffer	TEX=1, C=0, B=0 (non-cacheable Normal)
HTTP server (30 KB)	AXI SRAM .dma_buffer	same
WebSocket (20 KB)	AXI SRAM .dma_buffer	same
ETH DMA descriptors (1 KB)	AXI SRAM 0x24000000	TEX=0, C=0, B=0 (Device/Strongly-Ordered)
FTP server (100 KB)	PSRAM 0x90177000	TEX=1, C=1, B=1 (Write-Back, no write-allocate)

AXI SRAM (MPU Region 2) is configured as non-cacheable (TEX=1, C=0, B=0), so there is no cache coherency issue for the ETH/HTTP/WS pools — the CPU and ETH DMA see the same data without any cache maintenance. The DMA descriptors sit in the first 1 KB of AXI SRAM with an overlapping MPU region set to Device/Strongly-Ordered (TEX=0, C=0, B=0) to prevent write reordering.

However, the FTP pool at 0x90177000 falls under MPU Region 5 — which is cached Write-Back (TEX=1, C=1, B=1), not shareable. So for FTP, there is a real DCACHE coherency concern: the CPU writes packet data through the cache, but the ETH DMA reads through the bus and could see stale data if the cache lines have not been written back. In practice, FTP transfers large sequential blocks and the cache likely evicts naturally, but it is not architecturally guaranteed. We should arguably be calling SCB_CleanDCache_by_Addr() before transmit, or moving the FTP pool to the non-cached PSRAM region at 0x90400000 (MPU Region 6, TEX=1, C=0, B=0).

Now to the harder part of your question — OCTOSPI arbitration. In memory-mapped mode, OCTOSPI1 sits behind the AXI bus matrix. The bus matrix arbitrates between masters (Cortex-M7, ETH DMA, etc.) using fixed priority with round-robin, so only one transaction reaches OCTOSPI at a time at the bus level. The OCTOSPI controller itself only processes one command/data phase at a time — it does not have multi-port capability.

So in theory, bus matrix serialization should make concurrent access safe — each master waits its turn. But the APS6408L PSRAM has its own constraints: 1 KB page boundaries (configured via ChipSelectBoundary = 10), refresh requirements (Refresh = 99), and the OCTOSPI operates at 50 MHz OPI DTR with DLYB-tuned quarter-period delay for data eye centering. If the bus matrix interleaves a CPU read mid-way through an ETH DMA burst, the OCTOSPI must deassert CS, handle the new access, then resume — and whether the PSRAM's internal state machine handles this gracefully under all timing conditions is not well documented.

What I observed empirically: under HTTP/WebSocket load (many small concurrent packets, frequent interleaved CPU + DMA access to OCTOSPI), data corrupted. Under FTP (large sequential transfers, less interleaving), it worked. This is consistent with either a timing/protocol issue when OCTOSPI switches between masters rapidly (CS deassertion, re-assertion, and DLYB-tuned sampling may not stay stable under rapid context switches), or the DCACHE coherency problem described above (FTP gets lucky with natural eviction, HTTP does not), or both compounding each other.

Bottom line: I would not recommend using cached PSRAM via OCTOSPI as a shared CPU/DMA buffer region. It may work under low contention, but it is fragile. My FTP pool still being in cached PSRAM is arguably a latent bug — it should either be moved to the non-cached PSRAM region (0x90400000) or to AXI SRAM if space permits.

Best regards,
Kai