[etherlab-users] Intermittent Large number of datagrams UNMATCHED

Mon Jul 4 07:10:55 CEST 2016

On Monday, 4 July 2016 15:20, quoth Graeme Foot:
> You could track down if it's a problem with a link between two particular
slaves
> by checking each slaves Link Lost Counter and CRC Bad Counter values.
> - Lost Link Counter Register (0x0310:0x0313)
> - RX Error Counter Register (0x0300:0x0307)
[...]
> Haven't actually done it yet myself, so would be interested to see if it
helps you.

FWIW, I monitor these (and others) periodically in the background (for
logging purposes) and they do indeed help to identify faulty cables or EMI
locations.  (This is one of the reasons I added register read+write support,
since it lets you fetch-and-clear the error counters atomically.)

UNMATCHED datagrams occur when a frame is received but it appears to contain
datagrams that the master didn't think it sent.  In addition to a
network-side fault, this can also be caused if your application violates the
Etherlab master's locking assumptions (eg. if you're calling into the master
from multiple threads -- this isn't impossible but care is required, and the
type of care needed varies between kernel/user/RTDM and which version you're
using).

Also note that (at least as far as I can tell) Etherlab does not verify the
CRC/FCS of incoming frames on its own (although it does do some sanity
checks) -- it relies on the network hardware or driver to drop invalid
frames before they get to it.  If the cause is external interference then
you should probably be seeing TIMED OUT datagrams, not UNMATCHED ones,
unless your hardware/driver is passing these frames on instead of dropping
them.  Check if your card has an "ignore checksums" and/or "checksum
offload" setting and see if they're enabled or not.  It might also be
interesting to monitor the EtherCAT network with Wireshark (either on the
master PC itself via the debug interface (if enabled), or via another PC
spying on the link between the master and first slave), and see what changes
in the network traffic around the time of the event.

There are some robustness improvements on the default branch, plus some
additional fixes in separate patches on the dev list.  While it's not too
likely to help in this case, you might want to consider giving some or all
of them a try to see if they do improve your issues -- though as Graeme
suggested, check your network first.