Reliably reading the CAN bus Rx FIFO overrun flag

Diez.R. · ‎2020-09-04

Hi all:

I have an STM32F407 and I am trying to implement a CAN bus software driver.

I want to know if there has been a Rx FIFO overrun, which would mean that the firmware is too slow and is not seeing all received messages. Each hardware Rx FIFO can only hold up to 3 messages.

I know that there is an address (CAN message ID) filter I could use, but an Rx overrun is always possible (if undesirable). I have implemented a software queue in interrupt context, but an Rx overrun could still happen (this is a busy board).

I thought that this would be an easy task: just enable the overrun interrupt flag, and check the flag like this:

extern "C" void CAN2_RX0_IRQHandler ( void )

{

// The code below has been simplified for this example.

while ( 0 != CAN_MessagePending( CAN2, CAN_FIFO0 ) )

{

CanRxMsg rxMsg;

CAN_Receive( CAN2, CAN_FIFO0, &rxMsg );

EnqueueRxMessage( &rxmsg );

}

if ( CAN2->RF0R & CAN_RF0R_FOVR0 )

{

s_rxFifoOverrun = true;

CAN2->RF0R |= CAN_RF0R_FOVR0;

}

Instead of manually reading the overrun flag, I could use CAN_GetFlagStatus( CANx, CAN_FLAG_FOV0 ) or CAN_GetITStatus( CANx, CAN_IT_FOV0 ) . In fact, both of these routines seem to access the same flag, at least on my STM32F407.

Instead of manually clearing the overrun flag, I could use CAN_ClearFlag( CAN_FLAG_FOV0 ) und CAN_ClearITPendingBit( CAN_IT_FOV0 ) . In fact, both of these routines seem to access the same flag, at least on my STM32F407.

The trouble is that reading a message from the Rx FIFO clears the Rx overrun flag (CAN_RF0R_FOVR0). This happens when CAN_Receive() does the following:

CANx->RF0R |= CAN_RF0R_RFOM0;

I generated an overrun condition for test purposes with these steps:

1) Disable the "FIFO 0 message pending Interrupt" (CAN_IT_FMP0) and the "FIFO 0 overrun Interrupt" (CAN_IT_FOV0, aka FOVIE0).

2) Send myself 4 CAN bus messages.

3) Enable those 2 interrupts.

I looked at the latest reference manual (RM0090), and there is a diagram called "Figure 341. Receive FIFO states". This diagram seems to suggest that you need to release the FIFO message 4 times: once to get the overrun flag, and 3 times to get the 3 Rx messages. But that does not seem the case, for the code above receives 3 messages. And the bits FMP0[1:0] ("FIFO 0 message pending") can only hold a value between 0 and 3. If you look at the FMP value in the diagram, that becomes clear.

So I thought: no problem, I can check the overrun bit first, and then read the rx message. Then check the overrun flag again, and read the next rx message.

But is there not a window of opportunity there, where the overrun flag can get lost? I mean the following:

1) My code checks the overrun flag, and is not set at the moment.

2) The hardware receives a CAN bus message and sets the flag. But it is now too late.

3) My code reads the message with CAN_Receive(). Then the overrun flag is cleared (and lost).

Say that I manually read the message first, without using CAN_Receive(), then check the overrun flag, and finally drop the Rx message with "CANx->RF0R |= CAN_RF0R_RFOM0". The window of opportunity got smaller, but is still there, isn't it?

Can anybody help me with this matter? I searched around to no avail.

Thanks in advance,

rdiez

MattKefford · ‎2021-02-24

I see this is from a few months ago so do you still need help with this or have you figured out a solution?

What I would do is enable the overrun flag by adding CAN_IT_RX_FIFO0_OVERRUN to your flags when you call HAL_CAN_ActivateNotification().

When an overrun occurs it will call the error callback and you can check the error, receive CAN messages and clear the error. As soon as you read one message, the overrun flag will be cleared but you should read all three messages and empty the FIFO.

Take this with a pinch of salt as this isn't my actual code, just an example:

void HAL_CAN_ErrorCallback(CAN_HandleTypeDef *hcan)
{
   uint32_t error = hcan->ErrorCode;
 
   if(error & HAL_CAN_ERROR_RX_FOV0)
   {
        HAL_CAN_RxFifo0MsgPendingCallback(&hcan);
        hcan->ErrorCode &= ~(HAL_CAN_ERROR_RX_FOV0);
   }
}

Alternatively you could enable the CAN_IT_RX_FIFO0_FULL flag and then use the HAL_CAN_RxFifo0FullCallback() to do the same thing.

Diez.R. · ‎2021-05-26

Your hint does not really seem to close the window of opportunity I described above.

MattKefford · ‎2021-05-26

I'm sorry you didn't find my suggestion helpful. Let's see if we can get it figured out then.

Yes using the approach you suggested, that window of opportunity is possible. It's a race condition because you have a check and then an action with a short but finite time between where the situation could change. You could use locks to get around this, I guess in this situation you'd stop the CAN peripheral, do your checking and reading then start the CAN peripheral again. https://stackoverflow.com/questions/34510/what-is-a-race-condition

What I suggested is a different approach and is asynchronous. You wouldn't be manually poking the status register every now and then to see if the overrun flag is set, but assigning a callback so that as soon as the overrun occurs you can act on it immediately. Where is the window here, because there is no check and then action, just the action?

I guess it begs the question though, what do you want to do when there is an overrun? You've already lost data at that point so all you can do is read the messages you have received and get ready for new messages. With automatic re-transmission the message should be sent again anyway. In your approach/situation you can check if there is an overrun, then check to see how many messages are currently in the FIFO using HAL_CAN_GetRxFifoFillLevel() and then read that many messages. If it is three messages then you're going to read all three and clear the overrun flag anyway so what's the use in polling for the overrun flag first? It is just an unnecessary step really. Better to let the receive function just do receiving and have separate code for error handling.

By the way, I wouldn't advise to just clear the error flag as my example above showed. If there's an overrun you should use the callback to read the three FIFO messages into your buffer.

Diez.R. · ‎2021-05-26

OK, let's go step by step. Let's start with the easiest point:

> I guess it begs the question though, what do you want to do when there is an overrun?

> You've already lost data at that point so all you can do is read the messages

> you have received and get ready for new messages.

It is important that either the transmitter or the receiver knows that there has been an overrun, so that the user can be informed that something is not right. Maybe the device cannot cope with the load, in which case the operator needs to know it. Maybe the device is operating outside its specs. Or whatever. This may be a real-time industrial control application, and you must report any overruns, in case they are an important symptom.

In any case, the device cannot ignore or miss an overrun condition. I must know that it has lost data. A possible outcome could be to stop what it is doing, and trigger a "resync" operation.

At the moment, I am trying to reliably find out on the device/slave side whether there has been an overrun, because it is not clear to me yet whether it is really possible with the STM32 hardware.

> With automatic re-transmission the message should be sent again anyway.

> [...]

That implies that the sender will know that the message has not been received. I am not an expert in CAN bus or in STM32. Can you confirm that this is the case?

I have noticed that there is a configuration setting for overruns:

CAN_MCR register, bit RFLM

Receive FIFO Locked Mode.

0: Receive FIFO not locked on overrun. Once a receive FIFO is full the next incoming

message will overwrite the previous one.

1: Receive FIFO locked against overrun. Once a receive FIFO is full the next incoming

message will be discarded.

If you set it to '0', the STM32 CAN bus controller on the device/slave side will probably report the message as "received", by setting the ACK slot in the CAN frame to dominant.

But if you set it to '1', the controller could conceivably stop acknowledging the frame, because it is going to get dropped. Do you know whether that is the case?

I will be dealing with the other points and doing some more tests later.

Regards,

rdiez

Diez.R. · ‎2021-05-26

I just tested with CAN_MCR register, bit RFLM = 1, discard, and the CAN frame gets lost.

My Linux box is not retransmitting in this case, but it seems to retransmit if the cable is disconnected before transmission and reconnected afterwards.

So my guess is that the STM32 CAN bus controller is always acknowledging the frame, even if RFLM is 1, which means the frame cannot actually be received by the firmware and will be dropped.

MattKefford · ‎2021-05-26

Well if you want to be aware of an overrun (or any error) then the way I suggested will work. You could toggle a pin and view it on a scope for a quick test or set a flag, send a serial message, set an error LED etc. to show that an overrun occurred.

I tested it in a similar way to you, by sending CAN messages to my board every 500ms. I set a breakpoint in the callback, paused my code for a couple of seconds then resumed running and it hit my breakpoint. Looking into the SFR's I could see the FIFO count was 3 and the overrun bit was set.

What I did for peace of mind was added the HAL_CAN_ErrorCallback() function in my can driver module and put a load of if statements checking all of the possible errors and then enabled all the interrupts so I could see which errors were occuring. These are the possible errors:

//HAL_CAN_ERROR_NONE

//HAL_CAN_ERROR_EWG

//HAL_CAN_ERROR_EPV

//HAL_CAN_ERROR_BOF

//HAL_CAN_ERROR_STF

//HAL_CAN_ERROR_FOR

//HAL_CAN_ERROR_ACK

//HAL_CAN_ERROR_BR

//HAL_CAN_ERROR_BD

//HAL_CAN_ERROR_CRC

//HAL_CAN_ERROR_RX_FOV0

//HAL_CAN_ERROR_RX_FOV1

//HAL_CAN_ERROR_TX_ALST0

//HAL_CAN_ERROR_TX_TERR0

//HAL_CAN_ERROR_TX_ALST1

//HAL_CAN_ERROR_TX_TERR1

//HAL_CAN_ERROR_TX_ALST2

//HAL_CAN_ERROR_TX_TERR2

//HAL_CAN_ERROR_TIMEOUT

//HAL_CAN_ERROR_NOT_INITIALIZED

//HAL_CAN_ERROR_NOT_READY

//HAL_CAN_ERROR_NOT_STARTED

//HAL_CAN_ERROR_PARAM

When I mentioned retransmission I was assuming you were going STM32 to STM32, I can't speak for going Linux to STM32 unfortunately.

Diez.R. · ‎2021-05-26

I looked further, and I cannot configure the other error interrupt CAN2_SCE_IRQHandler (STATUS CHANGE/ERROR INTERRUPT) to trigger on overrun, because FOVR0 / FOVIE0 (FIFO overrun interrupt enable) can only trigger interrupt CAN2_RX0_IRQHandler (FIFO0 INTERRUPT), according to "Figure 348. Event flags and interrupt generation" in document RM0090.

Therefore, it looks like the current STM32 CAN bus hardware will not allow the firmware to reliably read the overrun error condition. There is always going to be a window of opportunity for an overrun to happen just before releasing the just-read FIFO entry. Such an overrun will then go unnoticed.

I hope I am wrong and someone can correct me.

MattKefford · ‎2021-05-26

Have a look in CAN2_RX0_IRQHandler() and see if you can see CAN_IT_RX_FIFO0_OVERRUN in there and if it is handled.

In my ST HAL code, the interrupts are handled by HAL_CAN_IRQHandler() which checks if the overrun has occurred, sets the errorcode and calls HAL_CAN_ErrorCallback(hcan). I would've thought there'd be something similar in your HAL code even though we are on different devices.

Diez.R. · ‎2021-05-26

Hi MKeff.1:

I am not using HAL, but the STM32 Standard Peripheral Library.

I wrote CAN2_RX0_IRQHandler() myself. It is not checking for the equivalent of CAN_IT_RX_FIFO0_OVERRUN because I know it has been set.

My CAN2_RX0_IRQHandler() is always checking for the overrun condition with if ( CAN2->RF0R & CAN_RF0R_FOVR0 ), and that is working fine.

This is how I verified this "lost overrun flag" problem:

1) I placed a breakpoint at CAN_Receive() inside the Standard Peripheral Library, at this line:

CANx->RF0R |= CAN_RF0R_RFOM0;

That is the statement at the end of that routine that drops the FIFO entry after having read its contents.

2) I sent a single CAN frame from my Linux box to my device, so that I hit the breakpoint.

3) I inspected the overrun flag in GDB like this:

// Equivalent to: print CAN2->RF0R & CAN_RF0R_FOVR0

print ((CAN_TypeDef *) (0x40000000 + 0x6800))->RF0R & 0x10

It was 0.

4) While the firmware was still stopped at the breakpoint, I sent 3 other CAN frames from my Linux box.

That is one too many.

5) I inspected the overrun flag in GDB again.

This time, it was 16. So the overrun flag had just been set by the hardware.

6) I stepped once over the line with the breakpoint.

7) I inspected the overrun flag in GDB again.

It was 0. Therefore, the overrun condition has been lost.

8) I let the firmware run, and hit the breakpoint twice again for the 2 remaining frames in the FIFO.

However, the overrun flag was no longer set, so it had been forever lost.

You can probably reproduce this sequence of events with HAL in a similar way.

Regards,

rdiez