Skip to main content
Aishwarya
Associate III
October 1, 2021
Question

Why does BLE Data Transmission block Arm® Cortex®�?M4?

  • October 1, 2021
  • 10 replies
  • 3871 views

STM32WB controllers are dual-core processers where Arm® Cortex®�?M4 core running at 64 MHz for application and Arm Cortex�?M0+ core at 32 MHz for network or BLE. 

To transmit data through BLE aci_gatt_update_char_value() Function used which blocks Arm® Cortex®�?M4 core.

Why does the BLE Transmission function block Arm® Cortex®�?M4?

Is there an approach or method to transmit data via BLE without blocking the Arm® Cortex® M4?

This topic has been closed for replies.

10 replies

Remi QUINTIN
Technical Moderator
October 5, 2021

How do you detect the M4 core is blocked?

This is not expected. M4 and M0+ cores are two separate and independent core. The only shared ressources are the shared SRAM buffers used for IPCC and the flash memory. Shared resources are protected via semaphores.

The M4 core is maybe waiting for an event from the M0+ core?

jro
Associate III
October 7, 2021

It's blocked because the supplied implementation of hci_cmd_resp_wait() in hci_tl.c has an INFINITE while loop (ignoring the timeout parameter!) which spins until hci_cmd_resp_release() is called from the interrupt-level code when CPU2 finishes its work.

Both hci_cmd_resp_ definitions are WEAK, so you can provide your own, and perform other functions "inside" hci_cmd_resp_wait(). This is non-trivial...

Aishwarya
AishwaryaAuthor
Associate III
October 8, 2021

Thanks a lot @jro​ , that helped.

It is surprising to see that community members understand the environment and the problems that might arise better than ST employees themselves.

Is there any application notes or something similar elaborating on such system functions and to what extent does the developer has scope to modify such system functions without "breaking" them? Due to the lack of official documentation and/or even comments in the generated code, it is very challenging to clearly understand the control flow around and into such functions.

With an expression of gratitude...

ST Employee
October 8, 2021

Hello,

When the aci_gatt_update_char_value() is sent to transmit data, two things happen :

1/ The command is sent from CPU1 to CPU2 and is acknowledged by CPU2. The data is stored in our internal buffers.

2/ The data is sent over BLE when possible.

The first step may hold CPU1 until CPU2 acknowledges the command whereas the second step is of course not blocking CPU1 execution.

So, CPU2 does not block CPU1 during execution but it is required some synchronization when sending a command.

Here is some background :

An application code should look like:

.....
aci_***();
...
...
aci_***();
... 
aci_***();

Obviously, in your application, you expect each aci_***() command to be completed on return before moving forward in the code and sending a new one. This is especially true when the command returns some status.

On a single core, there is no question.

For a dual core, the command is executed on the second core. However, the application shall be working in the same way so it still does not expect the aci_***() command to return before it has been acknowledged by CPU2.

When the weak functions hci_cmd_resp_wait() and hci_cmd_resp_release() are not implemented in the application, the default behavior is that the HCI Transport Layer is blocking until the CPU2 provides the acknowledge. The overhead of the IPCC communication between the two cores is a couple of tens of us.

So, there should not be any significant performance issue in the application when using the default implementation.

For those applications running at 64MHz where any single ten of us matters, we implemented a mechanism to hold the current process to execute another one.

This is achieved with hci_cmd_resp_wait() and hci_cmd_resp_release(). You may check the BLE_HeartRateFreeRTOS project where OS semaphores are used.

When using an OS and the dedicated Semaphores, the current task is pending until CPU2 provides the acknowledgment and CPU1 may run any other OS tasks in the meantime.

When using a baremetal implementation, such feature is usually not available. However, the sequencer ( which is a simple baremetal packaging AND NOT a custom OS) provides a similar functionality that is used in all our non-OS examples.

By default, there should be no reasons why you would need to change the implementation although this is possible.

The mechanism is described in AN5289.

In rev5, you may find:

  • a MSC flow description given at p136
  • an api description given at p138 - (note: I noticed a typo where shci_cmd_resp_wait() is described twice, second description is related to shci_cmd_resp_release() - this will be fixed in some later release)

There is the exact same information for hci_cmd_resp_wait() at page 143 and 144.

You may check as well the description given in the header file hci_tl.h and schi_tl.h in the Cube Package Firmware..

Please, let us know which kind of information you are missing in the AN5289 so that we may improve the description.

Regards.

OCatt.1
Associate III
November 23, 2021

Sorry, what is the solution to the original issue of the transmission causing blocking?

I am presently having this exact issue however I'm unable to find a resolution.

ST Employee
November 23, 2021

Hello,

Could you please provide more details on your issue.

Basically, there is no blocking on the CPU1 due to BLE transmission on CPU2.

Regards.

OCatt.1
Associate III
November 24, 2021

Hi @Christophe Arnal​ ,

For more details on my issue please read the below response to jro, as well as the post found [HERE].

Thank you.

jro
Associate III
November 24, 2021

Basically, there is blocking in the CPU1 stack - you've contradicted your previous post, @Christophe Arnal​ ! You said "When the weak functions hci_cmd_resp_wait() and hci_cmd_resp_release() are not implemented in the application, the default behavior is that the HCI Transport Layer is blocking until the CPU2 provides the acknowledge". Which is true.

@OCatt.1​ , you'll have to implement your own copies of the above functions, such that when execution reaches your hci_cmd_resp_wait() your application loops doing anything except BLE stack calls (because it's already doing one...), and returns when your hci_cmd_resp_release() has been called (which is done under interrupt).

As I noted previously, this is non-trivial. I've done it by implementing my system in FreeRTOS, so the BLE stack has its own task, and no other task cares if it's blocked. This is complex and in any case I can't post the code because I did it for work purposes! I think the ST sequencer is intended to provide this sort of capability at a very basic level - you'll find it in Utilities/sequencer/stm32_seq.c and .h, several examples use it.

OCatt.1
Associate III
November 24, 2021

Hi @jro​ ,

Thanks for the response and for the depth you've gone into with your answer, it's been very helpful.

In my personal case, within the hci_send_req function the while(local_cmd_status == HCI_TL_CmdBusy) stalls as the inner loop is never entered (no events raised) to change the local_cmd_status value. I originally thought the hci_cmd_resp_wait() function was causing the stall (never exited as no events raised) however realised that they were both part of the same problem.

Is there a reason that when changing the characteristic value this situation arises? Should I be receiving an event in this situation?

I am yet to attempt using a sequencer and will attempt it shortly, however I was planning on implementing a timeout on the hci_cmd_resp_wait (possibly that timeout which currently does nothing...) to release the wait and alter the local_cmd_status value simultaneously, however I'm unsure if this may cause issues elsewhere?

Any advice you could give me would be greatly appreciated.

jro
Associate III
December 2, 2021

Hi @OCatt.1​ 

Just taken a look at this, though not with a debugger, so this will be a bit theoretical ...

I only ever use aci_gatt_update_char_value() to change a Characteristic's value (there's also a aci_gatt_write_char_value() but I've no real idea what the difference is: seems to be connection-based rather than service-based, and maybe starts the process but doesn't complete it): both call hci_send_req(), The documentation says aci_gatt_write_char_value() generates an event, but not aci_gatt_update_char_value() - that just returns an error code. I notice that Christophe Arnal only mentioned aci_gatt_update_char_value() above.

As far as I can see hci_cmd_resp_wait() is poorly conceived. Although it nominally has a timeout, the ST-supplied implementation ignores it, and even if your implementation uses it, there's no official way to figure out if the wait completed or timed out, as the enum concerned only has WAIT and RELEASE values. In hci_send_req() the outermost while() loop spins until it gets a response, so the system can very readily get terminally wedged.

You've not said which function you're calling, but if it's not aci_gatt_update_char_value() you might want to give that a try...

OCatt.1
Associate III
December 3, 2021

Hi @jro​ ,

Thank you for the response!

Apologies, I am using the aci_gatt_update_char_value() function in order to change the Characteristic's value.

I have managed to circumvent the issue by implementing a short timeout within the outermost while() loop such as to break it free, whilst also removing the hci_cmd_resp_wait() call. This doesn't seem like the intended/optimal use case, do you mind me asking how you went about resolving this issue for yourself? Was the implementation of a FreeRTOS you mentioned previously the solution?

My current project serves as file/data transfer between PC app and external device via the STM32WB55 nucleo as a BLE link. I'm transferring data from the app by updating the Characteristic value of the nucleo, which then transfers received data via UART to my device, which responds with an Ack and on receiving of serial data (HW_UART_Receive_IT), it updates the Characteristic value via aci_gatt_update_char_value(), such that the PC App receives the callback for the Characteristic "ValueChanged" event and sends the next packet.

I mention this for context as to my current situation, and to ask regarding throughput. Currently this process is extremely slow, sending 253 bytes per packet and each packet has approximately .5 seconds interval between them, which as you can imagine causes even small files to take a very long time. Processing on both the App and the external device sides are very quick and shouldn't be causing a delay close to this degree, leaving the resultant cause to be within the bluetooth processing.

I have ensured 2M PHY is used and I am using Write without response/notify.

Any help you can provide me to both of these matters would be incredibly appreciated.

Regards.

jro
Associate III
December 7, 2021

Hi @OCatt.1​ 

I don't recall actually having had an issue, apart from having to replace the weak definitions of [s]hci_cmd_resp_wait() and [s]hci_cmd_resp_release(). This really only helped me free up the CPU so other FreeRTOS tasks could run, I don't think it'll help your problem. As far as I can see, so long as you do a well-formed call which ends up in hci_send_req(), it should return very quickly, even if the result is an error.

When you say you have to break out of the outermost while() loop, is that because hci_cmd_resp_wait() is stalled, or because it gets released but there's then nothing useful in HciCmdEventQueue? I've forgotten the details, but I seem to recall there's a bunch of tricky setup required to get the IPCC interrupt to vector into the BLE handler code. IIRC there was a weak definition somewhere which allowed the code to compile correctly while totally failing to operate... Again, doesn't sound like your issue, as you're getting things to work, albeit slowly.

From what you said above it sounds as if you might be waiting for data to be acknowledged before sending more (not totally clear, though) - I think that would definitely cause throughput problems.

I'm getting about 128kB/s on a 1Mbit/s link. I have one task reading a Flash sector (512B), then copying it 128B at a time to my Characteristic's buffer, then sending a request to the BLE task (via a FreeRTOS queue) to update the value on CPU2 from that buffer. The BLE task does that by calling aci_gatt_update_char_value(), which takes however long and may return an error if CPU2 is "too busy" (I'm vague on the details of that!). If I get an error, I just retry "immediately", within the constraints of FreeRTOS's queueing and scheduling overhead.

My target client is an iOS app, so I'm limited to a 185-byte MTU, hence the convenient choice of 128-byte notification payloads. I've done a little debug using Windows: the throughput on that is OK, though it can take an age to connect for some reason.

OCatt.1
Associate III
December 7, 2021

Hi @jro​ ,

When you say you have to break out of the outermost while() loop, is that because hci_cmd_resp_wait() is stalled, or because it gets released but there's then nothing useful in HciCmdEventQueue?

The situation I'm getting appears to be that there is nothing useful in HciCmdEventQueue, thus causing the while loop to go infinite.

 I seem to recall there's a bunch of tricky setup required to get the IPCC interrupt to vector into the BLE handler code.

Thank you for letting me know of this, I will give this area a look to see if it could allow me to at least not have to brute force out of the loop!

From what you said above it sounds as if you might be waiting for data to be acknowledged before sending more (not totally clear, though) - I think that would definitely cause throughput problems.

I did have a feeling that this was the case, however I was unsure as to why it would be as Write without response was being used?

On the off chance it provides any information, when sniffing the Bluetooth packets during a transfer I see the following sequence:

Direction Protocol Length Desc Time Diff From Prev Packet

  • PC -> Module, ATT, 256, Write Command, +0,
  • Module->PC, HCI_EVT, 8, Number of completed packets, ~+50ms,
  • Module->PC, HCI_ACL, 32, [Reassembled in last packet], ~+80ms, (+3ms to +9ms between each of these, varying),

I believe this to be some data (or Ack) when the module changes the char value, it is repeated 9 times per sent packet, with the final item in the sequence being where it is reassembled.

  • Module->PC, ATT, 16, Handle Value Notification, ~+1ms (~+120ms from HCI_EVT packet)

Following this final packet, the next Write Command occurs (next packet sent), following an ~+45ms delay from the Handle Value Notification packet.

I am unsure where I really ought to be looking regarding where the delays are occurring as it appears like there is a significant delay in the packet being received by the module from the app, the ack being sent by the module to the app (serial comms between module and connected device occurs in <5ms), as well as the time taken from the sent Ack packet to the following packet being sent (App firing next packet immediately as soon as "Value changed" event occurs. C# App using "WriteValueAsync(Buffer, GattWroteOption.WriteWithoutResponse);"), making it very hard to pinpoint a specific place to be looking for delays.

I really appreciate the help you've provided so far, thank you a lot JRO!

jro
Associate III
December 7, 2021

Definitely worth chasing why HciCmdEventQueue has nothing (useful) in it - there is only one place I can find that hci_cmd_resp_release() is called, which is in TlEvtReceived() and there's something put in HciCmdEventQueue there!

Have you done a aci_gatt_exchange_config() after establishing the connection? Re-assembling 9 packets sounds as if the MTU is still at 23 bytes, so your data is getting fragmented for transmission. CFG_BLE_MAX_ATT_MTU is set in app_conf.h: mine's 156, which seems to be the stack default. You may still get fragmentation as 253 bytes don't fit into 156, and even if you set an MTU of >=253 bytes it may not be accepted; but it's worth checking.

I don't think a long Connection Interval is relevant, but that might be worth playing with, just to prove its not the blocker.

OCatt.1
Associate III
December 8, 2021

Hi @jro​ ,

Thanks for that insight, it's actually led to me finding something, though not a solution.

I tested calling the aci_gatt_update_char_value() immediately when the packet sent by the App is received by the module (within STM_Event_Handler()) rather than sending through the serial and waiting for a serial response, and when this occurs the event works perfectly fine and the process functions correctly. This suggests that the issue is lying in the fact that aci_gatt_update_char_value() is called within the serial interrupt callback (UART_RxCpltCallback() called via HW_UART_Receive_IT()).

Could it be that for some reason when called within this callback, the event is blocked or cut off before occurring? If this is the case, why has every piece of example software provided by ST that I've seen using this sort of feature got this process occurring this way with the call occurring within this callback?

To try and overcome this I attempted to have the callback set a flag, with an if statement checking this flag within the infinite while loop in main, however this didn't work (seemingly never called the function), I'm unsure if this while loop is active similarly to how others function. Would this be where the Sequencer is used instead?

I'm not sure on how to use this sequencer as of yet, assuming just a call to the UTIL_SEQ_SetTask()?

I've checked CFG_BLE_MAX_ATT_MTU, which is set as 300 possibly causing it to not be accepted? I'll check different values.

I call aci_gatt_exchange_config() within the HCI_LE_CONNECTION_COMPLETE_SUBEVT_CODE event in SVCCTL_App_Notification(), so I'd assume the MTU would update, I also tested (in current state) having altered the BLE_DEFAULT_ATT_MTU within ble_bufsize.h, which had no affect on the fragmentation which occurred.

Having used CubeMX, I have FAST_CONN_ADV_INTERVAL to 400 for Min/Max, LP_CONN_ADV_INTERVAL to 600, MAX_CONN_EVENT_LENGTH is also at 0xFFFFFFFF.

jro
Associate III
December 9, 2021

I suspect calling aci_gatt_update_char_value() in an ISR is likely to cause problems - certainly not something I do, but then with FreeRTOS I don't need to... I assume your flag is marked volatile so the foreground code doesn't optimise away the check.

TBH I've not really looked at the ST examples and use of the sequencer in any detail, so can't really comment on how they get their examples to work. I thought their sequencer ran entirely in the foreground, and just flagged some "tasks" as stalled if e.g. the BLE stack was processing a request; the interrupt then clears the flag when the request is completed. But I could be wrong.

Don't think the advertising interval is anything to with the connection interval.

aci_gatt_exchange_config() returns the agreed MTU somewhere in its response structure, so it should be easy enough to check. There are points in the code where although a packet could be >255 bytes, the length parameter is a uint8_t - hopefully that isn't causing the fragmentation! Easy to set 150, anyway, which should be acceptable to most clients.