2025-12-03 4:48 AM - last edited on 2025-12-03 8:27 AM by mƎALLEm
Hello,
We are facing an intermittent CAN-FD issue in the field and would appreciate guidance from the community.
Our system has two devices on the bus (no other device on the bus) using a request–response architecture. The master sends a request every 30 ms and the slave responds. This setup is deployed in hundreds of units running continuously (24 hours). Out of these, around 3 to 5 units per day show acknowledgement errors, which we track through the CAN protocol error counters.
The behaviour is unusual:
• The issue appears randomly on any unit.
• No bus-off condition is ever reported.
• Despite no bus-off, communication between the two nodes temporarily stops.
• Communication recovers automatically after a few seconds without any intervention.
Initially, we suspected a physical wiring problem. We re-checked all connectors and even secured them with glue. The bus has 120-ohm termination at both ends. However, the issue still appears randomly.
Below are the system details:
Microcontroller: STM32G0B1CBT6
Baudrate: 125 kbps
CAN bus length: ~100 cm
Termination: 120 ohms at both ends
FD-CAN Core Clock: 50 MHz
ClockDivider: 1
Bitrate Switching: Disabled
AutoRetransmission: Disabled
TransmitPause: Disabled
ProtocolException: Disabled
Nominal Bit Timing:
• Prescaler = 10
• SyncJumpWidth = 8
• TimeSeg1 = 31
• TimeSeg2 = 8
Data Bit Timing (BRS disabled, same as nominal):
• Prescaler = 10
• SyncJumpWidth = 8
• TimeSeg1 = 31
• TimeSeg2 = 8
Filters:
• StdFiltersNbr = 1
• ExtFiltersNbr = 0
Physical wiring check
• Verified connector seating and cable condition
• Applied glue to prevent vibration-related disconnection
• Confirmed correct 120-ohm termination at both ends
Error counter monitoring
• ACK errors observed in protocol error counters
• No error-warning, error-passive, or bus-off states reported
Timing verification
• Checked nominal bit timing settings
• Ensured both nodes use identical configurations
• Bitrate switching is disabled on both sides
Bus recovery logic
• Bus-off recovery is implemented
• Never triggered during these events
Environmental factors
• Units run 24×7
• Errors occur randomly across different devices and locations
Master Device
↕ (approx. 100 cm cable)
Slave Device
Termination resistors (120 ohms) are present at both ends. No other nodes are connected.
Any insights or suggestions on what could cause intermittent ACK errors without bus-off would be greatly appreciated.
2025-12-03 5:20 AM - edited 2025-12-03 6:07 AM
Hello,
What is the source clock of FDCAN? a crystal or an internal RC clock source? so you should definitely use an external crystal.
Please also refer to FDCAN knowledge bases related subjects:
STM32 FDCAN running at 8 Mb/s on NUCLEO boards
How to use FDCAN to create a simple communication with a basic filter
FAQ: Fixing STM32 FDCAN communication disruptions - APB bus, kernel, and time quanta clocks
2025-12-03 6:06 AM
> Our system has two devices on the bus (no other device on the bus) using a request–response architecture.
I don't think this is an appropriate description for a CAN connection.
The (standardized) CAN frame contains an ACK slot at the end, and every node receiving the message without error acknowledges it.
This is built into the CAN peripheral IP, ther is no core intervention required.
• The issue appears randomly on any unit.
This sounds like a noise / EMI issue. See below.
• No bus-off condition is ever reported.
• Despite no bus-off, communication between the two nodes temporarily stops.
• Communication recovers automatically after a few seconds without any intervention.
"Bus off" is only the last stage.
Most probably the sending node goes into "error passive" mode, i.e.a "receive only" mode.
The normal mode is re-enabled after a delay.
The default error limit for the "error passive" mode is 128 (the default for "bus off" is 255).
I would recommend to check the respective error counters in your code.
This are usually the byte fields "REC" and "TEC" in the CAN->ESR (error status register).
To track the issue down, I would instrument the code to check for this condition, and toggle e.g. a GPIO if such an error occurs. This can be used as a scope / logic analyser trigger, to record the bus signals at that time.
Of course a sufficient period leading up to the event needs to be recorded.
As mentioned above, I suspect noise or EMI issues.
Perhaps some high-energy switching event nearby, or EMI interference on the PCB.
2025-12-03 6:22 AM - edited 2025-12-03 6:22 AM
One more vote for either:
- bad clock: either HSI / RC or crystal with too high tolerance
- EMI: what's the environment ?
2025-12-03 6:30 AM
> One more vote for either:
- bad clock: either HSI / RC or crystal with too high tolerance
Yes, that could cause it as well, if the configuration "at the edge".
Although CAN seems not very sensitive crystal variations. The ECUs of my company use mostly CAN, on heavy construction machinery. There are almost no complaints about CAN issues, despite some "average" priced crystals.
By the way:
For this kind of issue (timing) a logic analyser is fine.
For EMI/noise issues, a proper scope is mandatory.
2025-12-03 6:41 AM
Now I have seen that it's "only" running at 125 kbps and no BRS, that makes clock issues less probable.
I'd also check all EMI precautions, from PCB to PCB, so layout, grounding, case connections, cable connections, cable shields, ...
2025-12-03 6:49 AM - edited 2025-12-03 6:50 AM
Many of our customers reported a CAN communications stops after a while when they use an RC based clock. This is either with CAN or FDCAN. So the first question that we ask: what clock source is used when an issue of CAN communication is reported. 99% of that kind of issues are coming from the usage of the internal RC. So we recommend them directly to use a crystal or crystal oscillator with a suitable PPM and I can confirm it solved their issue.
2025-12-03 6:49 AM
> Now I have seen that it's "only" running at 125 kbps and no BRS, that makes clock issues less probable
Yes, with only 1m of cable, this unlikely. I know networks with dozens of meters in length running on 250kBit.
However, the use of the term "CAN-FD" by the OP is a bit confusing.
This standard would allow up to 8 MBit for certain parts inside the CAN frame.
2025-12-03 6:52 AM - edited 2025-12-03 6:54 AM
@Ozone wrote:
> Now I have seen that it's "only" running at 125 kbps and no BRS, that makes clock issues less probable
Yes, with only 1m of cable, this unlikely. I know networks with dozens of meters in length running on 250kBit.
However, the use of the term "CAN-FD" by the OP is a bit confusing.
This standard would allow up to 8 MBit for certain parts inside the CAN frame.
I suppose he's using FDCAN in Classical mode: No BRS used.
2025-12-05 2:19 AM - last edited on 2025-12-05 2:29 AM by mƎALLEm
@mƎALLEm, @Ozone and @LCE. Thank you for response.
We are continuing to debug a CAN (Classic CAN, 125 kbps) reliability issue and would like further guidance from the community. Below is the detailed information requested earlier, along with our latest observations.
1. Clock Source
The nodes are running from the internal RC oscillator. Unfortunately, the hardware design does not include a crystal oscillator, so we cannot switch to an external clock source.
2. Monitoring Error Counters
Based on your suggestions, we will update the firmware to continuously log the CAN error counters (TEC/REC) and protocol error status. We will share logs once available.
3. Nature of the Issue
We have seen that in some units the issue appears only for a few milliseconds, whereas in others it persists for several minutes. The behaviour is intermittent and varies across units deployed in the field.
4. Current Baudrate and Future Plan
We have reduced the baud rate to 12.5 kbps for testing.
In the next phase, we are planning to shift from continuous periodic communication (every 30 ms) to an event-based communication model. We would appreciate your thoughts on whether such a change would meaningfully improve robustness.
Since we cannot add an external crystal, we would also like to know if there are alternative methods to mitigate RC-oscillator drift-related problems.
5. Clock Calibration on STM32H743
Our second CAN node uses an STM32H743. We were planning to evaluate FDCAN Clock Calibration with 125 kbps (as described in ST’s application note “Introduction to FDCAN peripherals for STM32 MCUs”).
Introduction to FDCAN peripherals for STM32 MCUs - Application note
Would using clock calibration on only one node help in any practical way?
Baudrate: 125 kbps
Bus length: ~1 meter
Termination: 120 Ω at both ends
FDCAN Core Clock: 50 MHz
Clock Divider: 1
BRS: Disabled
Auto-Retransmission: Disabled
Transmit Pause: Disabled
Protocol Exception: Disabled
Nominal Bit Timing:
Prescaler = 10
SyncJumpWidth = 8
TimeSeg1 = 31
TimeSeg2 = 8
Data Bit Timing:
(Same as nominal since BRS is disabled)
Filters:
Standard Filters: 1
Extended Filters: 0
Baudrate: 125 kbps
Bus length: ~1 meter
Termination: 120 Ω at both ends
FDCAN Core Clock: 100 MHz
Clock Divider: 1
BRS: Disabled
Auto-Retransmission: Disabled
Transmit Pause: Disabled
Protocol Exception: Disabled
Nominal Bit Timing:
Prescaler = 20
SyncJumpWidth = 8
TimeSeg1 = 31
TimeSeg2 = 8
Data Bit Timing:
(Same as nominal)
Filters:
Standard Filters: 1
Extended Filters: 0
Any insights would be extremely helpful. Further suggestions for stabilizing communication without a crystal oscillator would also be appreciated.