cancel
Showing results for 
Search instead for 
Did you mean: 

NUCLEO-U5A5ZJ-Q USB CDC ACM maximum speed using USBx

Kannan1
Associate III

Hi,

I am using NUCLEO-U5A5ZJ-Q board  to setup a USB VCP connection with PC. I am currently using the example project CDC-ACM, and getting a maximum speed of up to 4.4MBPS via the connection by increasing the Tx FIFO  size and Max Packet size in device side example code.

 

  HAL_PCDEx_SetTxFiFo(&hpcd_USB_OTG_HS, 0, USBD_MAX_EP0_SIZE/4);
  HAL_PCDEx_SetTxFiFo(&hpcd_USB_OTG_HS, 1, 1920);
  HAL_PCDEx_SetTxFiFo(&hpcd_USB_OTG_HS, 2, USBD_CDCACM_EPINCMD_HS_MPS/4);

 

Since this example is for USB - UART bridge, I have removed the UART part of it and just kept the usbx_cdc_acm_write_thread_entry thread function enabled for sending data to PC. I have written a python script in windows side to receive the buffer.

I have attached the file ux_user.h for reference,

 

 

And I have increased the memory pool size or the device, 

 

#define TX_APP_MEM_POOL_SIZE                     (1024*1024)

#define UX_DEVICE_APP_MEM_POOL_SIZE              (500*1024)

#define USBPD_DEVICE_APP_MEM_POOL_SIZE           (10000)

 

Also the stack size,

 

#define USBX_DEVICE_MEMORY_STACK_SIZE       100*1024

#define UX_DEVICE_APP_THREAD_STACK_SIZE   1024

 

And providing the buffer size as 32768 bytes to ux_device_class_cdc_acm_write function. I am sending 100 * 32768 bytes in a loop for checking the speed. and receiving at the PC side. 

Or do I need to use the non blocking function ux_device_class_cdc_acm_write_with_callback for getting maximum throughputAnyway I have tried that by disabling the macro UX_DEVICE_CLASS_CDC_ACM_TRANSMISSION_DISABLE and setting up the callback functions in USBD_CDC_ACM_Activate function.

 

  /* Start Bulk transmission thread */
  UX_SLAVE_CLASS_CDC_ACM_CALLBACK_PARAMETER CDC_VCP_Callback;
  CDC_VCP_Callback.ux_device_class_cdc_acm_parameter_read_callback = &USBD_CDC_ACM_read_callback;
  CDC_VCP_Callback.ux_device_class_cdc_acm_parameter_write_callback = &USBD_CDC_ACM_write_callback;
  if (ux_device_class_cdc_acm_ioctl(cdc_acm, UX_SLAVE_CLASS_CDC_ACM_IOCTL_TRANSMISSION_START,
                                    &CDC_VCP_Callback) != UX_SUCCESS)
  {
    Error_Handler();
  }

 

Unfortunately the code goes to hardfault handler somehow. I haven't dig into it much, since I am not sure if it solves the issue with speed. Is there any example project for this callback mode if it provides a better throughput.

Is there any configuration to change in this example code so that I could get at least 10MBPS over USB HS VCP class. Or is this a limitation of the USBx USB stack? Do I need to write a separate USB stack code other than USBx provided to get more speed. I think since USB HS supports up to 480mbps, so I should get at least half of it that is 240mbps(30MBPS). What should be factor here to limit this speed in this application.

Please let me know the suggestions.

Thanks

9 REPLIES 9
Kannan1
Associate III

Update:

Here is the OTG_HS_PCD_Init function, and dma is disabled, do I need to enable DMA for better speed?

void MX_USB_OTG_HS_PCD_Init(void)
{

  /* USER CODE BEGIN USB_OTG_HS_Init 0 */

  /* USER CODE END USB_OTG_HS_Init 0 */

  /* USER CODE BEGIN USB_OTG_HS_Init 1 */

  /* USER CODE END USB_OTG_HS_Init 1 */
  hpcd_USB_OTG_HS.Instance = USB_OTG_HS;
  hpcd_USB_OTG_HS.Init.dev_endpoints = 9;
  hpcd_USB_OTG_HS.Init.speed = PCD_SPEED_HIGH;
  hpcd_USB_OTG_HS.Init.phy_itface = USB_OTG_HS_EMBEDDED_PHY;
  hpcd_USB_OTG_HS.Init.Sof_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.low_power_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.lpm_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.use_dedicated_ep1 = DISABLE;
  hpcd_USB_OTG_HS.Init.vbus_sensing_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.dma_enable = DISABLE;
  if (HAL_PCD_Init(&hpcd_USB_OTG_HS) != HAL_OK)
  {
    Error_Handler();
  }
  /* USER CODE BEGIN USB_OTG_HS_Init 2 */

  /* USER CODE END USB_OTG_HS_Init 2 */

}

 

Good question:
I had to measure my U5A5 VCP UART speed as well (yes, I am using also U5A5 with VCP UART).
The U5A5 should use OTG USB HS (with integrated PHY) and we would "assume": the speed is now 480 Mbps.

But it is not true!

Do you know, that MCU, running a VCP UART device - depends on the host (in terms of USB timing). The link (wire) speed (e.g. 480 Mbps) does not matter so much: it more related to the question: "when would the host (PC) ask again for new characters (or when would it send new characters from host to MCU)?").

I understand VCP UART in this way:

  • MCU provides the Device - but the host controls the timing, e.g. when MCU would be able to send something, or with which period the host would send something to MCU
  • FS and HS can differ, e.g. HS can use USB "micro-frames" (and maybe send more often)

Usually I take VCP UART (as Device on MCU) this way:

  • the host (as master) will "query" the device (MCU), e.g. every 1 ms (in FS mode)
  • it will ask for new bytes to receive on host side (so that MCU is "allowed" to send) or it will send from host some bytes to MCU
  • the max. packet size (number of characters) is 64 for USB VCP

But all comes down to "understand USB", FS vs. HS (with faster transmission, more frequently, with "micro-frames").

You cannot "judge" the MCU: the USB speed is controlled by USB and esp. the host. The Device cannot send anything without "requested/allowed' to do so. So, the possible throughput depends heavily on the host side.

My suggestion:

  • send always the maximum packet size, e.g these 64 characters, on both sides (from host to MCU, from MCU to host)
  • now measure the "throughput" (as bits per seconds)
  • faster as what you get - you cannot improve on MCU side: the Master (the PC as USB Host) sets the timing (when MCU is asked to send something back to host - if this "query" comes just every x ms - nothing you can do on Device side)

I guess: with VCP UART you can reach up to 4..8 Mbps as throughput, never mind if FS or HS is used.

FBL
ST Employee

Hello @Kannan1 

> Unfortunately the code goes to hardfault handler somehow.

Would you share the fault analysis? 

For USB OTG HS can potentially improve the speed of data transfer. OTG_HS embeds an internal DMA with thresholding support and software selectable AHB burst type in DMA mode.

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.


I'm out of offce with limited access to my emails.
Happy New Year!

Just raising questions:

You configure to many large memories, e.g.: 

#define TX_APP_MEM_POOL_SIZE                     (1024*1024)

#define UX_DEVICE_APP_MEM_POOL_SIZE              (500*1024

This is already 1.5 MB plus some other configs (malloc heap, stack...).

Are you sure, you have so much memory?

You get potentially a hardfault handler because you try to use memory outside the available space ("jumping into the forest").

For ACM (USB VCP UART) it is really enough to test with 64 byte packets: as I understand:

  • the max. packet size for one ACM/VCP UART packet is 64 bytes - anyway!
  • see the EP configuration in USB stack, e.g. for Enumeration/Descriptors
  • check the DMA config for USB: if it would be really able to send larger as 64 bytes per USB packet
  • but why more as USB allows to send as one packet? (64bytes)

Even in my ACM project the DMA is set to "disable":

void MX_USB_OTG_HS_PCD_Init(void)
{
  hpcd_USB_OTG_HS.Instance = USB_OTG_HS;
  hpcd_USB_OTG_HS.Init.dev_endpoints = 9;
  /*
   * ATTENTION: this must be FS speed in order to work on Android Ethernet tethering!
   * enumeration works but not the DHCP server to get an IP address. This setting solves the issue!
   */
  hpcd_USB_OTG_HS.Init.speed = PCD_SPEED_HIGH;		//PCD_SPEED_FULL;	//PCD_SPEED_HIGH;
  hpcd_USB_OTG_HS.Init.phy_itface = USB_OTG_HS_EMBEDDED_PHY;
  hpcd_USB_OTG_HS.Init.Sof_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.low_power_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.lpm_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.use_dedicated_ep1 = DISABLE;
  hpcd_USB_OTG_HS.Init.vbus_sensing_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.dma_enable = DISABLE;
  if (HAL_PCD_Init(&hpcd_USB_OTG_HS) != HAL_OK)
  {
    Error_Handler();
  }
}

I assume: running USB with DMA needs much more as just setting to ENABLE: a DMA engine/channel must be enabled, DMA INT handlers have to be there... maybe this causes your hardfault_handler called...

 

 

BTW: I tried: I have enabled DMA

void MX_USB_OTG_HS_PCD_Init(void)
{
  hpcd_USB_OTG_HS.Instance = USB_OTG_HS;
  hpcd_USB_OTG_HS.Init.dev_endpoints = 9;
  /*
   * ATTENTION: this must be FS speed in order to work on Android Ethernet tethering!
   * enumeration works but not the DHCP server to get an IP address. This setting solves the issue!
   */
  hpcd_USB_OTG_HS.Init.speed = PCD_SPEED_HIGH;		//PCD_SPEED_FULL;	//PCD_SPEED_HIGH;
  hpcd_USB_OTG_HS.Init.phy_itface = USB_OTG_HS_EMBEDDED_PHY;
  hpcd_USB_OTG_HS.Init.Sof_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.low_power_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.lpm_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.use_dedicated_ep1 = DISABLE;
  hpcd_USB_OTG_HS.Init.vbus_sensing_enable = DISABLE;
  hpcd_USB_OTG_HS.Init.dma_enable = ENABLE;	//DISABLE;
  if (HAL_PCD_Init(&hpcd_USB_OTG_HS) != HAL_OK)
  {
    Error_Handler();
  }
}

It works as before (no issues for me - no difference).

I have tried with my ACM code (U5A5, AZURE RTOS). Even I do not trust the Tick count frequency... what I get:

Send 640 KB of data from MCU to host:

	int i;
	unsigned int startTS, endTS;
	startTS = HAL_GetTick();
	for (i = 0; i < 10000; i++)
		VCP_UART_Send((const uint8_t *)"1111111111222222222233333333334444444444555555555566666666667777", 64);
	endTS = HAL_GetTick();
	print_log(out, "\r\nstart: %ul | end: %ul | delta: %ul | %ul bytes\r\n", startTS, endTS, endTS - startTS, i * 64);

It reports:

start: 763 | end: 1264 | delta: 501 | 640000 bytes

My AZURE RTOS HAL_GetTick() seems to be wrong:
It takes approx. 1 sec to send these 640,000 bytes (based on host terminal observation).
The expired tick as 501 seems to be actually 1000 (milli-seconds).

It would result in: VCP UART (ACM) is: approx.: 5,120,000 bps (5.1 Mbps).

Even I have enabled HS - I was also expecting a higher throughput.
Anyway: it is still in the range I have expected for ACM/VCP UART: 5...7 Mpbs (even USB configured as HS). It depends on host, e.g. host requests just every 1ms a new response from MCU or when busy even with longer period.

Maybe:
The host display of received bytes slows it down! The PC prints on TeraTerm all the received bytes. Maybe this slows down the speed (host will not request so fast anymore when it cannot be displayed so fast as well).

Remark:
to measure the USB VCP (ACM) throughput depends heavily how the host interacts. You cannot judge the performance/speed of MCU just by measuring what you can do on MCU side: the MCU is a Device and the timing depends on the Host (PC) - host can slow down!

What you measure at the end is: "what the host is capable to process in real time" (allows to be sent by MCU).

 

Kannan1
Associate III

@FBL Thanks for the reply. I have managed the hard fault handler issue in my code and now the ux_device_class_cdc_acm_write_with_callback  API in non-blocking mode works successfully. But the speed is now reduced to 3 MBPS from 5 MBPS of the blocking mode API. 

I have also tried by enabling the internal IP DMA in the configuration. Below attaching the screenshots of ioc configuration for DMA. The GPDMA configuration was already from the example project.

DMA enableDMA enableDMA Rx channelDMA Rx channelDMA Tx channelDMA Tx channel

After enabling the DMA in ioc, the write function _ux_dcd_stm32_transfer_request failed to send the data over USB and it is waiting for a semaphore,

            /* We should wait for the semaphore to wake us up.  */
            status =  _ux_utility_semaphore_get(&transfer_request -> ux_slave_transfer_request_semaphore,
                                                (ULONG)transfer_request -> ux_slave_transfer_request_timeout);

So, My doubt is that, Is there any other configurations need to be done for working of the internal DMA? So this where I stuck in DMA configuration.

Next is in the working project without enabling DMA and using the call-back method, there is an another issue which I suspect a delay which limits the speed of USB. I will explain the issue.

After calling the ux_device_class_cdc_acm_write_with_callback function in usbx_cdc_acm_write_thread_entry, I will wait in the thread to finish the transaction and to execute the call-back function from _ux_device_class_cdc_acm_bulkin_thread. Here is the code in usbx_cdc_acm_write_thread_entry.

while(total_bytes_to_send)
{
    if (usb_ready)
    {
		usb_ready = 0;

		if (buf_indx >= APP_TX_DATA_SIZE)
		{
			buf_indx = 0;
		}

		if (total_bytes_to_send > APP_TX_DATA_SIZE)
		{
			bytes_to_send = APP_TX_DATA_SIZE;
		}
		else
		{
			bytes_to_send = total_bytes_to_send;
		}

		if (ux_device_class_cdc_acm_write_with_callback(cdc_acm, (UCHAR *)(&UserTxBufferFS[buf_indx]),
			bytes_to_send) != UX_SUCCESS)
		{
			// Error condition
			total_bytes_to_send = 0;
		}
    }
	
    /* Sleep thread for 10ms */
    tx_thread_sleep(MS_TO_TICK(10));
}

Here, the usb_ready flag is set from the call-back function after each write. The issue is with the sleep here tx_thread_sleep(MS_TO_TICK(10)); , without this 10ms delay the event send from the application will not be triggering the _ux_device_class_cdc_acm_bulkin_thread function to send the data.

           /* Wait until we have a event sent by the application. */
            status =  _ux_utility_event_flags_get(&cdc_acm -> ux_slave_class_cdc_acm_event_flags_group, UX_DEVICE_CLASS_CDC_ACM_WRITE_EVENT,
                                                                                            UX_OR_CLEAR, &actual_flags, UX_WAIT_FOREVER);

            /* Check the completion code. */
            if (status == UX_SUCCESS)
            {

So is this delay is not avoidable? Maybe this causes the throughput issue in USB on call-back method in my case. Please let me know your comments/suggestions. I have been trying for a while to increase the USB speed in U5A5.

@tjaekel Could you please suggest an article or a document to configure the USB DMA. Or your method. And RAM size is almost 2.5 MB, that's why I have used it like a luxury, maybe it could cause an issue later, Thanks for pointing it out.

I think, my STM32U5A5 (reference manual, RM) does not support DMA mode for a device.
I use DMA on a regular UART. But for USB DMA, even I set:

  hpcd_USB_OTG_HS.Init.dma_enable = ENABLE;	//DISABLE;

there is no speed difference.

When I check the RM, I see this:

OTG_HS embeds an internal DMA with thresholding support and software selectable
AHB burst type in DMA mode. (page 3276)

73.15.4 DMA mode
The OTG host uses the AHB master interface to fetch the transmit packet data (AHB to
USB) and receive the data update (USB to AHB). The AHB master uses the programmed
DMA address (OTG_HCDMAx register in host mode and
OTG_DIEPDMAx/OTG_DOEPDMAx register in peripheral mode) to access the data
buffers. (page 3390)

So, as I understand:

  • USB has a dedicated (different) DMA (USB Host has its own DMA engine)
  • and this DMA can be enabled in USB Host mode (only) (all about DMA on USB is related to Host)

But USB VCP UART is a USB Device. So, I assume, USB does not use a DMA for VCP UART transfer (in both directions).
It has FIFOs and INTs, which should be still fast enough for a good throughput.

I try to build the project with no blocking write. When I comment out

"#define UX_DEVICE_CLASS_CDC_ACM_TRANSMISSION_DISABLE" in ux_user.h,

the code always fail at _ux_utility_memory_free_block_best_get() function called by

" cdc_acm -> ux_slave_class_cdc_acm_bulkout_thread_stack =
_ux_utility_memory_allocate(UX_NO_ALIGN, UX_REGULAR_MEMORY, UX_THREAD_STACK_SIZE * 2);"

in function _ux_device_class_cdc_acm_initialize() whatever I increase

#define TX_APP_MEM_POOL_SIZE (1024*1024)

#define UX_DEVICE_APP_MEM_POOL_SIZE (8192*8)

#define USBPD_DEVICE_APP_MEM_POOL_SIZE (5000*2)

How did you modify the original code to make no blocking write working?

 

The tx_thread_sleep(MS_TO_TICK(10)) can not removed since those threads at the same priority with no slice time define.

I guess you may call the ux_device_class_cdc_acm_write_with_callback() in callback function to avoid this.

What I have realized:
using AZURE RTOS and the USB stacks - my project has used "blocking". But changing to use the API call for non-blocking ended up in a missing function call.

There is a LIB file involved (linked to the project to resolve missing functions). I had no idea which LIB file to use which seems to provide the "non-blocking" implementation.

I assume, much more to do as just modifying some macros: you might need different header files, esp. a different LIB file.

It is also a bit unclear: if you see a header file with macros to set memory sizes - if this is just a "reference" what is used by the LIB code. Evern you would change it - if the LIB generated was compiled with these settings - it will not have an effect (it just "tells" you witch which parameters the LIB linked at the end was built).

"Have fun" (I gave up to convert from blocking to non-blocking, potentially needing a different LIB file to link).