cancel
Showing results for 
Search instead for 
Did you mean: 

STM32 + Quectel BG95 Modem: System hangs after ~60 days - UART Idle Line setup never returns

GR88_gregni
Associate II

Background

We have a battery-powered device using:

  • MCU: STM32 (STM32L4series)
  • Modem: Quectel BG95 (cellular)
  • Communication: UART with Idle Line Interrupt (HAL_UARTEx_ReceiveToIdle_IT)
  • Power mode: Device goes to sleep periodically, wakes up to communicate with modem

Architecture Pattern

Our communication pattern:

  1. Device wakes from low-power mode
  2. Call low-level AT command function that:
    • Enables UART RX Interrupt with Idle Line detection (HAL_UARTEx_ReceiveToIdle_IT)
    • Sends AT command
    • Waits for response with timeout
    • Disables interrupt (HAL_UART_Abort_IT)
  3. Process response
  4. Go back to sleep

This means we re-enable/disable UART Idle Line interrupt on every transaction (potentially hundreds of times per day).

UART + Ring Buffer Setup:

#define RING_BUFFER_SIZE 2048
#define ISR_BUFFER_SIZE  1024

typedef struct {
    lwrb_t rb;                              // lwrb ring buffer
    volatile bool rx_data_ready_flag;       // Flag set by ISR
    uint8_t rb_buffer[RING_BUFFER_SIZE];    // Ring buffer storage
    uint8_t isr_buffer[ISR_BUFFER_SIZE];    // Intermediate buffer for Idle Line ISR
} uart_rb_t;

uart_rb_t modem_rb;

 

ISR Callback:

void HAL_UARTEx_RxEventCallback(UART_HandleTypeDef *huart, uint16_t Size) {
    if (huart == &MODEM_UART) {
        // Save data from ISR buffer to ring buffer
        lwrb_write(&modem_rb.rb, modem_rb.isr_buffer, Size);
        modem_rb.rx_data_ready_flag = true;
        
        // Re-enable Idle Line interrupt
        int retries = 10;
        do {
            if (HAL_UARTEx_ReceiveToIdle_IT(&MODEM_UART, 
                                            modem_rb.isr_buffer,
                                            sizeof(modem_rb.isr_buffer)) == HAL_OK) {
                break;
            }
            retries--;
        } while (retries > 0);
    }
}

 

The Problem

After ~60 days of continuous operation, the device completely froze with no recovery.

Symptoms:

  • System printed the last log line: >>StartParse >>>>
  • Then complete silence - no further output
  • No HardFault triggered (we have handler with reset + logging - it was never called)
  • Device required power cycle

Last logs before hang:

<<EndParse <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>StartParse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
"AT+QISTATE=1,1" --> [response received OK]

<<EndParse <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>StartParse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
"AT+QISTATE=1,1" --> [response received OK]

<<EndParse <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>StartParse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
"AT+QISTATE=1,1" --> [response received OK]

<<EndParse <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>StartParse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[SYSTEM FROZE HERE - no further output]

Notice: The command string was not printed in the last call, suggesting the code hung before that point.

Code Structure (simplified)

Parent function:

uint32_t tick = HAL_GetTick();
do {
    return_code = modem_send_command_wait_parse_result(
        "AT+QISTATE=1,1", 
        "+QISTATE:", 
        /* parsing params */,
        300 /* timeout ms */
    );
    
    if (condition_met) break;
    HAL_Delay(100);
    
} while (HAL_GetTick() - tick < 15000);  // 15 second outer timeout

 

Low-level function structure:

int modem_send_command_wait_parse_result(..., int timeout, ...) {
    // Local buffers
    char formatted_command[1024] = {0};
    char buffer_final[2048] = {0};
    unsigned int total_bytes = 0;
    
    printf(">>StartParse >>>>\n");
    if (command_to_send != NULL) {
        printf("\"%s\" --> ", command_to_send);
    }
    
    // Enable UART RX Interrupt with Idle-line detection
    int retries = 10;
    do {
        if (HAL_UARTEx_ReceiveToIdle_IT(&MODEM_UART, 
                                        modem_rb.isr_buffer,
                                        sizeof(modem_rb.isr_buffer)) == HAL_OK) {
            break;
        }
        retries--;
    } while (retries > 0);
    
    if (retries == 0) {
        return_code = -1;
    }
    
    uint32_t tick = HAL_GetTick();
    
    // Main receive loop with timeout
    while (return_code > 0 && ((HAL_GetTick() - tick) < timeout)) {
        // Check flag and read from ring buffer
        if (modem_rb.rx_data_ready_flag) {
            modem_rb.rx_data_ready_flag = false;
            
            int bytes = lwrb_read(&modem_rb.rb, 
                                  &buffer_final[total_bytes],
                                  sizeof(buffer_final) - total_bytes);
            total_bytes += bytes;
        }
        
        // Parse response, check for expected strings, etc.
        // ...
    }
    
    // Cleanup
    HAL_UART_Abort_IT(&MODEM_UART);
    lwrb_reset(&modem_rb.rb);
    
    return return_code;
}

 

My Questions

  1. Is re-enabling UART Idle Line interrupt on every transaction a valid approach? Could repeatedly calling HAL_UARTEx_ReceiveToIdle_IT (with 1KB buffer) → HAL_UART_Abort_IT → HAL_UARTEx_ReceiveToIdle_IT (hundreds of times over 60 days) cause UART peripheral corruption?
  2. The retry loop, when we enable idle line in low level, sometimes returns != HAL_OK (HAL_BUSY or HAL_ERROR). Is this expected behavior, or does it indicate underlying UART state problems that could accumulate over time? 
  3. Race condition in flag handling? The rx_data_ready_flag is set in ISR and cleared in main loop without atomic operations. Could this cause issues:
// Main loop (non-atomic):
   if (modem_rb.rx_data_ready_flag) {        // Read
       modem_rb.rx_data_ready_flag = false;  // Write - ISR could interrupt here!
   }​
  • Could printf() cause the hang? We use printf extensively for debugging over a separate UART. Could printf buffer overflow or UART TX blocking cause the system to freeze without triggering exceptions?
  • HAL_GetTick() overflow handling: After 50 days, HAL_GetTick() wraps around. Our timeout check is (HAL_GetTick() - tick) < timeout. Is this safe with overflow?
  • Ring buffer overflow: If the 2KB ring buffer fills up and data is lost, could this cause the expected response string to never arrive, leading to timeout? Though this should be caught by the timeout logic...

Request for advice:

  • Are we using UART Idle Line correctly for this use case (repeated enable/disable)?
  • Is the flag handling race condition a real concern, or is it benign?
  • Any known issues with long-term UART peripheral usage on STM32?
  • Could be an issue the OPEN LOG dev board that we have connected to our device in order to collect logs into an SD card?
9 REPLIES 9
TDK
Super User

You should be able to attach a debugger without resetting the device to examine its state.

There is nothing inherent to the device which stops working after X days. It has to be a bug in the code somewhere.

1ms * 2^32 is 49.7 days. Possible you have an issue with a timeout that uses systick, but the HAL functions handle this overflow correctly.

If you feel a post has answered your question, please click "Accept as Solution".
Andrew Neil
Super User

Is this just one isolated instance on one particular device, or are you seeing many such occurrences?

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.
Andrew Neil
Super User

@GR88_gregni wrote:
  • Are we using UART Idle Line correctly for this use case (repeated enable/disable)?

So what, exactly, is the purpose of the Idle Line interrupt in your system?

You seem to be just doing AT commands - that can be (usually is?) done without Idle Line detection...

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

This is the first occurrence we've seen. We have 3 devices in the field for 2 months, and this is the only one that froze after 60 days. However, we're concerned it may be a latent bug that will affect others.

Andrew Neil
Super User

Did you resolver your Long Blocking Operations Dilemma ?

Having very long blocking delays sounds risky...

 

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

We use Idle Line interrupt to detect when the BG95 modem has finished transmitting its response, since AT command responses are variable-length and we don't know in advance how many bytes to expect.

Yes I have resolved that.

Then please feed-back in that thread and mark the solution.

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

@GR88_gregni wrote:

AT command responses are variable-length and we don't know in advance how many bytes to expect.


But they have well-defined termination criteria.

The usual approach is to look for the termination.

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.