Skip to main content
Associate III
June 9, 2026
Solved

STLINK-V3MINIE - Feature: Pipelined Data Transfers

  • June 9, 2026
  • 1 reply
  • 36 views

@ ST engineers: This is a feature request for the firmware of the STLINK-V3MINIE and related debuggers/programmers: pipelined data transfers over the existing USB High-Speed link.

In my pursuit to greatly improve the STLINK-V3MINIE's performance for large continuous reads from MCU memory, I discovered there is huge room for improvement: SWD's full potential (~2038 KB/s = 24M [cycles/s] * 4 [bytes/transfer] / (46 [cycles/transfer])) is not yet used at all. (Note that USB shouldn't be the bottleneck, judging from the maximum clock speed the STLINK-V3MINIE's STM32F723's USB IP can support.) I only achieve ~549 KB/s when using an STM32H7S3 MCU with ST's programming SW (v2.20.0) using STLINK-V3MINIE (FW V3J17M10):

STM32_Programmer_CLI --connect port=SWD freq=24000 ap=1 mode=HOTPLUG -halt --read 0x24000000 0x60000 /tmp/dmp.bin

A simplified sequence diagram of the communication used here is shown:

Read memory region; current situation

I measured the SWD clock line with an oscilloscope to verify 24 Mhz is reached, and I can confirm this is the case, so SWD should only take ~491 us (= 1 KB / ~2038 [KB/s]).
But when looking at the USB timings, the time from JTAG_READMEM_32BIT command to 1024-byte data arrival is ~1642 us (of which only ~40 us for the sending of the command). So it seems like most of the time (~1111 us = ~1642 us - ~40 us - ~491 us) is not usefully spent, considering the USB data rate is much faster than the SWD data rate.

In order to improve performance, I tried dividing the buffer read into bigger chunks (6144 instead of 1024 bytes) and leaving out the status queries:

Read memory region; use bigger chunks, leave out status

and this did improve performance a bit (~633 instead of ~549 KB/s), but there still seems to be a lot of time uselessly spent: the time from JTAG_READMEM_32BIT command to 6144-byte data arrival is ~9340 us (of which only ~40 us for the sending of the command), which is again much higher than the time SWD communication should take: ~2944 us (= 6144 bytes / ~2038 [KB/s]).

Therefore I wanted to go one step further, and enable pipelined transfers: queue multiple concurrent read requests such that all the time that was previously spend doing stuff other than SWD communication (i.e., preparation, USB communication, and finalization) can be done in parallel with SWD communication. I figured two concurrent read requests would be a good starting point. Here is how the communication would look like:

Read memory region; pipelined transfers

When I tested this, after having sent two commands, reading the data timed out. Likely this is because the STLINK-V3MINIE's firmware doesn't support two concurrent read requests. It seems like the firmware is protected with an authentication header, so I wasn't able to easily patch the firmware for this new feature. Therefore I would like to ask ST whether they can implement such feature?

Thanks a lot!

Best answer by S C

Hello,

Thank you for your effort and suggestion; I’m aware there is a room for improvement regarding the maximal performance of the ST-Link at SWD COM level. First I would like to focus about the context of a read command on STM32: currently the maximal absolute gain would apply to a flash full read back (4MB max currently on STM32), after a programming phase. Transfers in debug context are much smaller. Without entering into the details, it might explain why the effort has not already been done. Anyway, I share my analysis of the topic, as it differs from yours: your theoretical computing would give something like 1,92µs for a 32bits word. While if you measured SWCLK with an oscilloscope, you could have seen that a word duration is rather around 4µs, and followed by 1,75µs of silence ! So if we want to improve, the work will reside here (at SWD COM implementation level only). I don’t speak about the WAIT state and ReadOk check which also impact a little bit the whole frame compared to the theoretical computing.

As you said, the USB is not the bottleneck; currently the USB transfer and SWD transfer are done sequentially; another improvement would be to parallelize both flows (as you suggest), but this would not give the results you currently expect (the USB will anyway at a time, wait for data from the SWD; the maximal gain we can expect is the USB transfer duration, optimal with ST-Link as you saw with 6144-bytes transfers. I measured 380µs with USB analyzer for such a transfer (note that it is also far from 480Mbits/s ...). The other latency you could see at USB level is the waiting for end of SWD transaction).

So as a conclusion I would say that your suggestion is not the solution for greater performance, and with ST-link I’m afraid you have to cope with the performance of the official firmware. Depending on the context, the overall performance of a tool might also be improved by reducing the SWD traffic (sometimes it’s possible), once you identified that SWD is the bottleneck. The improvement of SWD performance by ST-Link is one of the background tasks depending on the priority assignment ...

Best regards

1 reply

S C
S CBest answer
ST Employee
June 10, 2026

Hello,

Thank you for your effort and suggestion; I’m aware there is a room for improvement regarding the maximal performance of the ST-Link at SWD COM level. First I would like to focus about the context of a read command on STM32: currently the maximal absolute gain would apply to a flash full read back (4MB max currently on STM32), after a programming phase. Transfers in debug context are much smaller. Without entering into the details, it might explain why the effort has not already been done. Anyway, I share my analysis of the topic, as it differs from yours: your theoretical computing would give something like 1,92µs for a 32bits word. While if you measured SWCLK with an oscilloscope, you could have seen that a word duration is rather around 4µs, and followed by 1,75µs of silence ! So if we want to improve, the work will reside here (at SWD COM implementation level only). I don’t speak about the WAIT state and ReadOk check which also impact a little bit the whole frame compared to the theoretical computing.

As you said, the USB is not the bottleneck; currently the USB transfer and SWD transfer are done sequentially; another improvement would be to parallelize both flows (as you suggest), but this would not give the results you currently expect (the USB will anyway at a time, wait for data from the SWD; the maximal gain we can expect is the USB transfer duration, optimal with ST-Link as you saw with 6144-bytes transfers. I measured 380µs with USB analyzer for such a transfer (note that it is also far from 480Mbits/s ...). The other latency you could see at USB level is the waiting for end of SWD transaction).

So as a conclusion I would say that your suggestion is not the solution for greater performance, and with ST-link I’m afraid you have to cope with the performance of the official firmware. Depending on the context, the overall performance of a tool might also be improved by reducing the SWD traffic (sometimes it’s possible), once you identified that SWD is the bottleneck. The improvement of SWD performance by ST-Link is one of the background tasks depending on the priority assignment ...

Best regards

EliasvanAuthor
Associate III
June 11, 2026

Dear S C,

Thanks a lot for your reply!

Transfers in debug context are much smaller.

This is not the case for my use case (which I should have mentioned in my post): I want the debugger to continuously read from a ring buffer in the MCU’s internal RAM. RTOS events such as context switches and mutex utilization are written by the MCU to the ring buffer at a high rate (equivalent to ~1 MB/s). (And there are other constraints that make that I cannot make use of ITM and SWO instead.)

your theoretical computing would give something like 1,92µs for a 32bits word. While if you measured SWCLK with an oscilloscope, you could have seen that a word duration is rather around 4µs, and followed by 1,75µs of silence !

Thanks for these details! I only measured the max. frequency of the SWCLK with the oscilloscope, but I didn’t spend the effort to also connect the SWDIO and measure the exact timing of 32bit word transfers. The theoretical computing is nonetheless conform with what is indicated in the reference manual, so the biggest room for improvement is in the SWD communication indeed as you mentioned!

I would say that is good news because optimizing the SWD communication can be done purely in the ST-Link’s FW and doesn’t require changes in the way the Host interacts with the ST-Link.

The improvement of SWD performance by ST-Link is one of the background tasks depending on the priority assignment ...

I understand, I’m awaiting eagerly for this to be implemented. ;-)