2018-07-27 06:50 PM
Hi,
We are sending confirmed messages. Every once in a while, we do not receive a confirmation from the gateway, which is fine. However, sometimes we are not seeing the callback in the ST LoRaWAN stack being triggered as expected. I expect the lack of confirmation from the gateway to trigger MCPS_CONFIRMED case in the (mcpsConfirm->Status == LORAMAC_EVENT_INFO_STATUS_RX2_TIMEOUT) in McpsConfirm() in lora.c. We need this triggered to take appropriate action in case acknowledgement/confirmation is not received.
NbTrials is set to 2 and I sure that it has tried 2 times. The RX2_TIMEOUT is never called for several minutes.
Does RX2 timeout for confirmed messages manifest in any other callback sometimes? Any thoughts or suggestions?
Thank you.
2018-08-02 11:20 AM
Hi, has anyone else experienced this issue? Any suggestions or tips would be very helpful! Thank you.
2018-09-03 06:21 AM
Hello,
Not sure if this is the same problem as you are seeing but there appears to be a bug in the LoRaWAN stack that I am using (LoRa software expansion for STM32Cube V1.1.5) that is triggered by a confirmed message request timing out and can then lead to stack being unusable until reset.
Although the problem was first encountered in our application project, I verified that the problem also occurred when using the AT command line slave project that is shipped with the package, thus making it easy to reproduce.
The set of circumstances that create the problem are:
1. AT+NJM=1 to enable OTAA join mode.
2. AT+CFM=1 to enable confirmed messages.
3. AT+JOIN to join the network.
4. AT+NJS=? to query join status (repeat until joined).
5. Disable Gateway.
6. AT+SEND=1,1 to send message.
7. Send will timeout due to no ACK being received.
8. Enable Gateway.
9. AT+JOIN to rejoin network.
10. AT+NJS=? to query join status (repeat until joined).
11. AT+SEND=2,2 to send message. Returns AT_ERROR.
After this, any AT+SEND or AT+JOIN commands will return AT_ERROR and will continue to do so until the stack is restarted.
The reason for this was found to be because the MAC state (LoRaMacState variable) was at 0x0001 (LORAMAC_TX_RUNNING) instead of having cleared to 0x0000 (LORAMAC_IDLE), thus effectively locking the stack up as busy.
This in turn was traced to the AckTimeoutTimer.IsRunning flag being set to 1 which in OnRadioRxDone() prevents the AT+JOIN procedure fully completing:
if( AckTimeoutTimer.IsRunning == false )
{
// Procedure is completed when the AckTimeoutTimer is not running anymore
LoRaMacFlags.Bits.MacDone = 1;
AckTimeoutTimer is not used by the join process, but is used during confirmed sends and it was found that the failed confirmed send (step 6 above) resulted in AckTimeoutTimer.IsRunning being left at 1 on completion of the process (not an issue then because OnRadioRxDone() not called as nothing received).
The issue looks to be with the timer implementation in timeServer.c.
When a timer is started, it is inserted into a list of active timers that is managed by the firmware such that the order of the list is the order in which the timers are scheduled to expire (head of list = first to expire). Each timer has a .IsRunning flag associated with it, but rather than being set for each active timer it would appear that only the timer at the head of the list gets it’s flag set.
When a timer expires, the TimerIrqHandler() calls the associated callback function for the timer and the timer is removed from the list. A timer can also be stopped before it has expired by a manual call to TimerStop() which, once again, sees it removed from the list.
Each timer callback function begins with a TimerStop() call for that timer, which implies that the timer has to be manually stopped (much like an interrupt flag has to be cleared by an ISR). Calling TimerStop() for a specified timer will search the list of timers for that timer and, if found, remove it from the list and clear its .IsRunning flag. However, because the TimerIrqHandler() function which invoked the callback has already removed the timer from the list, the TimerStop() function effectively does nothing because the timer is no longer in the list. Crucially, the removal of the timer from the list in the TimerIrqHandler() function does not also clear the .IsRunning flag, hence the AckTimeoutTimer’s flag remaining set despite the timer having expired.
Although this appears fairly fundamental, the only checking of a timer’s .IsRunning flag outside of the timer functions themselves, is in the OnRadioRxDone() function in LoRaMac.c, hence it not being noticed more readily.
Three different fixes were tested and verified to overcome the issue:
1. Explicit clearing of timer’s .IsRunning flag in TimerStop() function (regardless of whether timer has already been removed from the list).
2. Make existing TimerExists() function in timeServer.c global and replace check for AckTimeoutTimer.IsRunning flag in OnRadioRxDone() function with test of TimerExists(&AckTimeoutTimer).
3. Remove deletion of timer from list in TimerIrqHandler().
Solution 2 has the least risk as it addresses one specific incidence, although, unlike the other 2, it does not actually clear the AckTimeoutTimer.IsRunning flag so I think solution 1 is the best as it clears the flag and is still low risk as the flag should be cleared when the timer is stopped.
I have also recently posted this in the Q&A section of the web site in the hope that STM Technical Support will respond.