2025-03-16 5:07 AM
I've been attempting to stream audio in via a USB microphone using the USB host functionality on my STM32F411. Unfortunately, it seems as though the audio class has been left unfinished and, although about 50% of the code is there, the other important 50%, that actually streams the input, is missing. I have spent a significant amount of time testing and googling (and frankly using a bit of ChatGPT because I don't have much time to spend digging around in the USB spec) to try and piece together what is missing.
I'm at a point where I am receiving data, but the data just seems to be entirely wrong. It does't seem garbled, but it tends to just repeat the same packet data over and over again, and sometimes it doesn't receive anything and the buffer stays filled with zeros.
This post was very helpful in laying the base, but I have since expanded on it a little bit. The first important bit I added was to set the microphone to use the highest possible frequency available, which in my case is 48kHz. This is purely so I am 100% confident that the packet size I calculate is correct, otherwise I just have to guess the microphone is set to that frequency. Below is the code used to do this, which I have placed inside the `USBH_AUDIO_ClassRequest` function. It's not amazing but it works.
case AUDIO_REQ_GET_FREQ:
if (AUDIO_Handle->microphone.supported == 1U) {
uint32_t max_freq_idx = 0;
uint32_t max_freq = 0;
for (int i = 0; i < AUDIO_MAX_SAMFREQ_NBR; i++) {
uint32_t freq = get_sample_rate(AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc, i);
if (max_freq < freq) {
max_freq = freq;
max_freq_idx = i;
}
}
uint8_t bit_depth = get_bit_depth(AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc);
uint8_t n_chan = get_n_channels(AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc);
AUDIO_Handle->microphone.frequency = max_freq;
// !!!!! frame_length is important as this is the size of packet
AUDIO_Handle->microphone.frame_length = ((max_freq * bit_depth * n_chan) / 8000U);
// setup single packet buffer
if (AUDIO_Handle->microphone.buf) {
USBH_free(AUDIO_Handle->microphone.buf);
}
uint16_t buf_len = AUDIO_Handle->microphone.frame_length;
AUDIO_Handle->microphone.buf = (uint8_t *)USBH_malloc(buf_len);
if (AUDIO_Handle->microphone.buf == NULL) {
return USBH_FAIL;
}
memset(AUDIO_Handle->microphone.buf, 0, buf_len);
AUDIO_Handle->req_state = AUDIO_REQ_SET_FREQ;
}
break;
case AUDIO_REQ_SET_FREQ:
if (AUDIO_Handle->microphone.supported == 1U) {
if (AUDIO_Handle->freq_state == AUDIO_FREQ_SET_INFERFACE1) {
USBH_StatusTypeDef status = USBH_SetInterface(phost, AUDIO_INTERFACE_NUM, 0);
if (status == USBH_OK) {
AUDIO_Handle->freq_state = AUDIO_FREQ_SET_INFERFACE2;
} else if (status == USBH_NOT_SUPPORTED) {
status = USBH_FAIL;
}
} else if (AUDIO_Handle->freq_state == AUDIO_FREQ_SET_INFERFACE2) {
USBH_StatusTypeDef status = USBH_SetInterface(phost, AUDIO_INTERFACE_NUM, 1);
if (status == USBH_OK) {
AUDIO_Handle->freq_state = AUDIO_FREQ_URB_OUT;
} else if (status == USBH_NOT_SUPPORTED) {
status = USBH_FAIL;
}
} else if (AUDIO_Handle->freq_state == AUDIO_FREQ_URB_OUT) {
AUDIO_Handle->mem[0] = AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc->tSamFreq[max_freq_idx][0];
AUDIO_Handle->mem[1] = AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc->tSamFreq[max_freq_idx][1];
AUDIO_Handle->mem[2] = AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc->tSamFreq[max_freq_idx][2];
USBH_StatusTypeDef status = USBH_AUDIO_SetEndpointControls(phost, AUDIO_Handle->microphone.Ep, (uint8_t *)(AUDIO_Handle->mem));
if (status == USBH_OK) {
AUDIO_Handle->freq_state = AUDIO_FREQ_SET_INFERFACE1;
AUDIO_Handle->req_state = AUDIO_REQ_IDLE;
} else if (status == USBH_NOT_SUPPORTED) {
status = USBH_FAIL;
}
}
}
break;
This makes sure the microphone is set to 48kHz, therefore the frame_length will be ((48000 * 16 * 1) / 8000)) which equals a packet size of 96 bytes, which is correct and within the 100 bytes `wMaxPacketSize`.
The next important part is the replacment of the `USBH_AUDIO_SOFProcess` and `USBH_AUDIO_InputStream` functions, which are what handle the incoming data.
static USBH_StatusTypeDef USBH_AUDIO_SOFProcess(USBH_HandleTypeDef *phost) {
AUDIO_HandleTypeDef *AUDIO_Handle = (AUDIO_HandleTypeDef *)phost->pActiveClass->pData;
if (audio_flag == 0) {
audio_flag = 1; // set flag to indicate new data should be received
}
return USBH_OK;
}
static USBH_StatusTypeDef USBH_AUDIO_InputStream(USBH_HandleTypeDef *phost) {
USBH_StatusTypeDef status = USBH_OK;
AUDIO_HandleTypeDef *AUDIO_Handle;
AUDIO_Handle = (AUDIO_HandleTypeDef *)phost->pActiveClass->pData;
if (audio_flag == 1) {
audio_flag = 0; // reset flag
USBH_StatusTypeDef status2 = USBH_IsocReceiveData(phost, AUDIO_Handle->microphone.buf, AUDIO_Handle->microphone.frame_length, AUDIO_Handle->microphone.Pipe)
// print the buffer to get an idea of the data
if (status2 == USBH_OK) {
for (int i = 0; i < AUDIO_Handle->microphone.frame_length; i++) {
printf("%d,", AUDIO_Handle->microphone.buf[i]);
}
printf("\n");
}
} else {
status = USBH_BUSY;
}
return status;
}
Here is where something is wrong. The buffer never seems to be filled with any real PCM data, and it often just seems to repeat the same packet data over and over again, but I cannot for the life of me figure out why. I have tried flipping around the code, where I read the packet inside the SOF callback instead, and then process it later in the InputStream function, but it replicates the same behaviour.
A little bit of confusion that I have run into is that I have use USBPcap to look at the packets in Wireshark, and at least from what wireshark is showing, each frame actually contains 10 packets, whereas in my case each frame is a single packet. However, according the old ChatGPT, the `wMaxPacketSize` is supposed to actually show what the maximum size of a frame is, and if each frame contains ten 96 byte packets, then it should be 960 bytes, but it still shows as just 100 bytes, which is just 1 packet. This also aligns with the behaviour I see since if I raise the buffer to 960 bytes then it still only fills with a single packets worth of bytes, not ten. So I'm a little lost as to who to believe since I can only receive one packet but I know my code is broken and if Wireshark is showing ten packets per frame, maybe that is correct?
I wish I could be a little more helpful in providing info but I have just been desperately trying anything and so it's hard to keep track of what I have tested.
If anyone has any idea how I can receive these PCM packets correctly that would be massively appreciated since I'm at a total loss!
2025-03-16 5:36 AM
About Audio Host I just can say: I tried it with the STM "examples", that are more or less just ONE example in some copies, and this is just for connecting a head phone set - and nothing else. So maybe you can use it, to get a microphone signal in, but that's all - If this is working at all. I didn't bother with that useless "Audio Host", but tried with Azure RTOS , because there the Audio Host can enumerate standard USB sound card and it is really working. But then I gave up on this, because for hi-end audio it seems useless: only at 48khz is a correct stream possible, but never at 44.1khz, because no packet sizes can have 44.1 samples, only 44 possible, so even this basic requirement cannot work on the USB bus. Just I don't know, how they do it with the hi-end Audio DAC s , maybe they connect with bulk mode and have the DAC some buffer memory and doing the correct timing.
2025-03-16 6:17 AM
I've based this off the example for audio already, but it's unfinished and doesn't work so I'm attempting to fix it myself. I agree with your confusion on tje 44.1kHz signal, I'm not sure how thats handled either since it doesn't divide to an integer size, but I'm not concerned with that too much right now since I'm just sticking to 48kHz.
Maybe I need to pivot to something RTOS and tinyusb or something as it doesn't seem like the STM usb stack will suit this.
2025-03-16 7:29 AM - edited 2025-03-16 7:30 AM
Yes, i cannot say much about the STM "example" , because when i saw, it cannot enumerate anything than the headset for a basic test - i gave up on this useless thing.
But the 44k1 problem seems a basic USB characteristic --- i always wondered, why on Win and Linux they always resample in the so called "mixer" to 48k - now i know, because only 48k is working at its correct frequency, if you connect a usb sound card etc.
And i would look at tinyusb at first, just look, if it can do audio host and what it can really enumerate then.
Only last option is Azure, because you have to deal with much more other problems then, maybe get Azure running at all might be not just some mouse clicks .
2025-03-16 8:46 AM
It is very unfortunate that STM never finished the audio example. They have at least half the code written already to enumerate the microphone, they just never wrote the code to stream the data in. It's weird.
I will have another look a tinyusb, see what I can find. I had a brief look a while ago but it seemed fairly complicated. To be honest, I've also bought a pi pico to try instead, which is annoying because I really wanted to get it working with STM as I've always wanted to try them out but, oh well.
I will leave this open for now maybe someone out there will have the solution though it seems unlikely as no one seems to have really tried this before.
2025-03-16 2:07 PM
Maybe this could help you, for just to see, whats missing in the STM example and copy it :