[etherlab-users] Intermittent Large number of datagrams UNMATCHED

Thu Jul 14 22:41:53 CEST 2016

I have found the likely cause in my case.  It appears to be an errata
with the Intel 82579 Ethernet Controller.  Ralf, if you have that
controller I suggest you change it

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/6-and-c200-chipset-specification-update.pdf

Errata 17 from the above document

17. Intel ® 82579 Gigabit Ethernet Controller Transmission Issue
Problem: Intel ® 82579 Gigabit Ethernet Controller with the Intel 6
Series Chipset and Intel C200
Series Chipset and Intel ME Firmware 7.x 5 MB may stop transmitting
during a data
transfer.
Implication: Intel 82579 Gigabit Ethernet Controller may stop
transmitting packets, the link LED will
blink, and a power cycle may be required to resume transmission
activity.
Note: This issue has only been observed in a focused test environment
where data is
constantly transferred over an extended period of time (more than
approximately 3
hours).
Workaround: A combination of Intel ME Firmware code change and Intel
82579 Gigabit Ethernet
Controller LAN Driver update has been identified and may be implemented
as a
workaround for this erratum.
Status: No Plan to Fix.

On Mon, 2016-07-11 at 10:32 -0700, Henry Bausley wrote:
> FYI,
> 
>    Just changing the host ethernet port seems to have alleviated our
> issues with UNMATCHED datagrams.  We saw something virtually identical
> to Ralph.
> 
> [451886.660655] EtherCAT 0: Domain 0: Working counter changed to 0/13.
> [451886.660663] EtherCAT 0: Domain 1: Working counter changed to 0/14.
> [451887.168147] EtherCAT WARNING: Datagram cea4900c (domain0-0-main) was
> SKIPPED 44 times.
> [451887.168154] EtherCAT WARNING: Datagram cea49c0c (domain1-332-main)
> was SKIPPED 44 times.
> [451887.492141] EtherCAT WARNING 0: 1 datagram TIMED OUT!
> [451887.492148] EtherCAT WARNING 0: 731 datagrams UNMATCHED!
> [451887.661361] EtherCAT 0: Domain 0: Working counter changed to 13/13.
> [451887.661369] EtherCAT 0: Domain 1: Working counter changed to 14/14.
> 
> 
>   In our case the Advantech UNO industrial PC has 4 ethernet ports built
> into it.  Only the 1st ethernet port built into the motherboard exhibits
> the issue, it shows up as an Ethernet controller: Intel Corporation
> 82579LM Gigabit Network Connection.  It appears that is just a PHY so
> the MAC I assume is in the Intel Corporation 6 Series/C200 Series
> Chipset.
> 
>   The other 3 ports are actually PCI Express MAC/PHYs, they show up as
> Intel Corporation 82574L Gigabit Network.  Those 3 ports do not exhibit
> the UNMATCHED datagram issue.
> 
>   When using ethtool -k the only difference I see for the 82579LM versus
> the three 82574L is rx-vlan-filter: off for the 82579LM .
> rx/tx-checksumming is on for all adapters.
> 
>   FYI,
>     The registers 0x300 and 0x310 remained 0 after the UNMATCHED
> datagram error occurred. 
> 
>   I suggest you look into changing the NIC Ralf.
> 
> On Mon, 2016-07-04 at 08:29 +0200, Ralf Roesch wrote:
> > We also are fighting with this type of problem on a customer laser
> > cutting machine.
> > Occasionally we see errors like this:
> > [122501.934306] EtherCAT 0: Domain 0: Working counter changed to 0/9.
> > [122501.934346] EtherCAT 0: Domain 1: Working counter changed to 0/9.
> > [122502.320449] EtherCAT WARNING 0: 5 datagrams TIMED OUT!
> > [122502.935224] EtherCAT 0: Domain 0: Working counter changed to 9/9.
> > [122502.935265] EtherCAT 0: Domain 1: Working counter changed to 9/9.
> > 
> > This was the reason I modified the ethercat command line tool for
> > extended diagnostics regarding several ESC error registers.
> > 
> > Attached you will find a patch which might help you.
> > After applying and building the ethercat command line tool it will
> > provide a new command "diag".
> >       * Shortly after your ethercat master has been started
> >         successfully call:
> >         ethercat diag -r
> >         This will reset all slaves ESC error registers including Lost
> >         Link Counter Register and RX Error Counter Register.
> >       * If you detect a an error UNMATCHED and TIMEOUT (sometimes
> >         after hours or days) call:
> >         ethercat diag
> >         If you are lucky you will find one ore more ESC errors
> >         displayed on your console.
> >         For better understanding the displayed errors you should to
> >         picture picture
> >         http://www.automation.com/images/article/ethercat/Figure14.jpg
> >         (part of
> >         http://www.automation.com/automation-news/article/diagnostics-with-ethercat-part-4).
> > 
> > Would be happy about any kind of feedback.
> > 
> > 
> > @Henry: which type of drives do you use?
> > 
> > 
> > Regards,
> > Ralf
> > 
> > 
> > 
> > On Mon Jul 04 2016 05:19:58 GMT+0200 (CEST), Graeme Foot
> > <Graeme.Foot at touchcut.com> wrote:
> > 
> > > The only time we've had issues like that has been due to either a dodgy network cable or an RJ45 plug getting a bit grubby.  First thing I usually do is unplug/replug all the plugs a few time to clean up the connections.  If it persists then I start looking for bad cables.
> > > 
> > > Another option is that there is an occasional noisy process causing noise on one of the links.
> > > 
> > > Once or twice (only on non-ethercat machines so far) we've had cables that were in drag chains wearing out, where it showed a problem when at a specific position of the drag chain.
> > > 
> > > You could track down if it's a problem with a link between two particular slaves by checking each slaves Link Lost Counter and CRC Bad Counter values.
> > > - Lost Link Counter Register (0x0310:0x0313)
> > > - RX Error Counter Register (0x0300:0x0307)
> > > 
> > > This link describes some of the diagnostics:
> > > http://www.automation.com/automation-news/article/diagnostics-with-ethercat-part-4
> > > 
> > > I think you can set the above registers to zero after the fieldbus is up and running, then you can check them if a problem occurs.
> > > 
> > > 
> > > Haven't actually done it yet myself, so would be interested to see if it helps you.
> > > 
> > > 
> > > Regards,
> > > Graeme.
> > > 
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: etherlab-users [mailto:etherlab-users-bounces at etherlab.org] On Behalf Of Henry Bausley
> > > Sent: Saturday, 2 July 2016 5:56 a.m.
> > > To: etherlab-users at etherlab.org
> > > Subject: [etherlab-users] Intermittent Large number of datagrams UNMATCHED
> > > 
> > > 
> > > 
> > > We have a etherlab 1.5.2 kernel mode application running in xenomai
> > > 2.4.6 on Ubuntu 14.04.1 Desktop that will get on rare  occasions a large number of datagrams UNMATCHED.  It occurs at random times and relatively rarely but when it occurs it can result in disaster as we are running a large number of servos in torque mode.
> > > 
> > > For example we can run continuously for 5 days 24hours continuously then get a message like something below.
> > > 
> > > [591785.735172] EtherCAT WARNING 0: 616 datagrams UNMATCHED!
> > > I am struggling as to where to look.  Is this something in our app or a known bug in the stack?
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Outbound scan for Spam or Virus by Barracuda at Delta Tau
> > > 
> > > _______________________________________________
> > > etherlab-users mailing list
> > > etherlab-users at etherlab.org
> > > http://lists.etherlab.org/mailman/listinfo/etherlab-users
> > > _______________________________________________
> > > etherlab-users mailing list
> > > etherlab-users at etherlab.org
> > > http://lists.etherlab.org/mailman/listinfo/etherlab-users
> > 
> 
> 
> 
> _______________________________________________
> etherlab-users mailing list
> etherlab-users at etherlab.org
> http://lists.etherlab.org/mailman/listinfo/etherlab-users