[etherlab-dev] CCAT startup failures

Patrick Brünn P.Bruenn at beckhoff.com
Tue Oct 27 07:57:52 CET 2015


Hi Scott,
When I investigated this issue I saw only bad addresses for the tx memory, which was a perfect fit for the startup stall.
But of course you are right, before the patch we played roulette and the same could happen to the rx memory, which would turn our game into Russian roulette...

Thanks for reporting and discussing this serious bug!

Best regards,
Patrick

From: Tillman, Scott [mailto:Scott.Tillman at bhemail.com]
Sent: Montag, 26. Oktober 2015 15:37
To: Patrick Brünn <P.Bruenn at beckhoff.com>; etherlab-dev at etherlab.org
Subject: RE: CCAT startup failures

Hi Patrick,

Initial results are very good.  I've been through about 20 reboot cycles so far without the stall occurring.  Before applying this patch the stalls occurred on about 2/3 of my reboots.  I'll keep looking but that seems to have resolved it.

The hard lock I reported only occurred when trying to unload the ec_ccat driver after a stalled startup, and even then, only on occasion.  Beyond that, no reliable reproduction was found.  Given that this patch appears to fix a bug in the DMA mapping passed to the ccat hardware I would guess that was the root cause of the occasional hangs?  Wouldn't the prior behavior cause the ccat hardware to write data into the wrong physical pages of memory?  If you're lucky that's an "unimportant" data page, if not it's some block of code that someone eventually calls.  Very ugly.

Thanks again for the patch and the quick response.

-Scott

From: Patrick Brünn [mailto:P.Bruenn at beckhoff.com]
Sent: Monday, October 26, 2015 4:22 AM
To: Tillman, Scott <Scott.Tillman at bhemail.com<mailto:Scott.Tillman at bhemail.com>>; etherlab-dev at etherlab.org<mailto:etherlab-dev at etherlab.org>
Subject: RE: CCAT startup failures

Hi Scott,
To fix the startup stall, please try the appended patch.
I tested it over the weekend and found no more "BUG! Tx Ring full" in the kernel log this morning.
However, I cannot reproduce the hard lockups. Do you have any hint on how to provoke them?

My setup:
Beckhoff CX5130
Ubuntu 14.04
Kernel 4.1.7-rt8
Stable 1.5.2 Etherlab with recent CCAT v0.13 bundle

My test script:
while :
do
        /etc/init.d/ethercat restart
        sleep 5
done

The patch:

# HG changeset patch
# User Patrick Bruenn <p.bruenn at beckhoff.com<mailto:p.bruenn at beckhoff.com>>
# Date 1445844490 -3600
#      Mon Oct 26 08:28:10 2015 +0100
# Branch stable-1.5
# Node ID 1d9be5716deec3c7e6a21dae85a8ad2ef8825b92
# Parent  0613017547feee7a2189fe71724ca5bcd1fcb61f
fix ccat startup stall. In case the dma memory was allocated above 4 GB limit, CCAT is unable to access it.
    -> limit the dma_mask to 32-bit to omit this situation

diff -r 0613017547fe -r 1d9be5716dee devices/ccat/module.c
--- a/devices/ccat/module.c     Tue Oct 13 12:24:16 2015 +0200
+++ b/devices/ccat/module.c     Mon Oct 26 08:28:10 2015 +0100
@@ -288,9 +288,8 @@
                goto cleanup_pci_device;
        }

-       if (!dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64))) {
-               pr_debug("64 bit DMA supported, pci rev: %u\n", revision);
-       } else if (!dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32))) {
+       /* CCAT is unable to access memory above 4 GB */
+       if (!dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32))) {
                pr_debug("32 bit DMA supported, pci rev: %u\n", revision);
        } else {
                pr_warn("No suitable DMA available, pci rev: %u\n", revision);


Regards,
Patrick


From: etherlab-dev [mailto:etherlab-dev-bounces at etherlab.org] On Behalf Of Tillman, Scott
Sent: Freitag, 23. Oktober 2015 05:52
To: etherlab-dev at etherlab.org<mailto:etherlab-dev at etherlab.org>
Subject: [etherlab-dev] CCAT startup failures

I am having problems with the startup of the CCAT driver (both v12 to the new v13).  When I boot up the driver is loading (ccat shows via lsmod).  When I start the master it swaps over to ec_ccat, but sometimes the master never begins communicating on the bus.  In these situations I can often remove and reinsert the ec_ccat module and the master will start up.  Occasionally I get a hard lockup of the system (Alt-SysRq-B won't reboot) when trying to unload or reload the module, but that's possibly unrelated.  I'm looking for solutions to both problems (startup stall and hard lockups).

System details:
Beckhoff CX5130
                Ubuntu 14.04 (Server) with ubuntu-desktop  installed
                Kernel 3.16.7 with RTAI 4.1 ipipe patch
                Stable 1.5.2 Etherlab with recent CCAT v0.13 bundle
Some changes that are included in the bundle I've already posted

I am accessing the master via the user space library and using the LXRT library for accessing RTAI.

At this point I'm looking for suggestions on how to try to establish the root cause of these problems.  I don't have much experience debugging kernel crashes, and what experience I do have is 15 years out of date.  I have no prior exposure to the linux network subsystem.

Any help would be appreciated.  Below are snippets of the syslog around various interesting conditions.  Let me know if I can provide any other possibly useful data.


-Scott Tillman

Advisory Software Engineer

Bell + Howell

919-767-6745


What I know so far:

I occasionally see this constantly repeating in my syslog during the stalled state:

Oct 21 10:58:03 ubuntu kernel: [   17.184386] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 21 10:58:03 ubuntu kernel: [   17.237043] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 21 10:58:03 ubuntu kernel: [   17.289702] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 21 10:58:03 ubuntu kernel: [   17.342599] (unregistered net_device): BUG! Tx Ring full when queue awake!

etc.  This continues until I rmmod ec_ccat.

Relevant logs on initial startup try in stalled state:

Oct 22 23:25:58 ubuntu dhclient: DHCPDISCOVER on CX5130_x001 to 255.255.255.255 port 67 interval 3 (xid=0x730db6c3)
Oct 22 23:25:58 ubuntu kernel: [   12.476197] EtherCAT: Master driver 1.5.2 71593adce088+
Oct 22 23:25:58 ubuntu kernel: [   12.481558] EtherCAT 0: Registering RTDM device EtherCAT.
Oct 22 23:25:58 ubuntu kernel: [   12.481586] EtherCAT: 1 master waiting for devices.
Oct 22 23:25:58 ubuntu avahi-daemon[623]: Interface CX5130_ebus.IPv6 no longer relevant for mDNS.
Oct 22 23:25:58 ubuntu avahi-daemon[623]: Leaving mDNS multicast group on interface CX5130_ebus.IPv6 with address fe80::201:5ff:fe1a:d56d.
Oct 22 23:25:58 ubuntu avahi-daemon[623]: Withdrawing address record for fe80::201:5ff:fe1a:d56d on CX5130_ebus.
Oct 22 23:25:58 ubuntu avahi-daemon[623]: Withdrawing workstation service for CX5130_ebus.
Oct 22 23:25:58 ubuntu NetworkManager[964]:    SCPlugin-Ifupdown: devices removed (path: /sys/devices/virtual/net/CX5130_ebus, iface: CX5130_ebus)
Oct 22 23:25:58 ubuntu kernel: [   12.603567] ec_ccat: registered ec_ccat as gpio chip with #12 GPIOs.
Oct 22 23:25:58 ubuntu kernel: [   12.603770] EtherCAT: Accepting 00:01:05:1A:D5:6D as main device for master 0.
Oct 22 23:25:58 ubuntu kernel: [   12.603794] EtherCAT 0: Starting EtherCAT-IDLE thread.
Oct 22 23:25:58 ubuntu kernel: [   12.603846] (unregistered net_device): NIC Link is Up
Oct 22 23:25:58 ubuntu kernel: [   12.603870] EtherCAT 0: Link state of ecm0 changed to UP.
Oct 22 23:25:58 ubuntu kernel: [   12.608008] ec_ccat: registered ccat_update0.
Oct 22 23:25:58 ubuntu kernel: [   12.608031] ec_ccat: ccat_sram_probe: 0x0016 rev: 0x0000
Oct 22 23:25:58 ubuntu kernel: [   12.655994] EtherCAT WARNING 0: 1 datagram TIMED OUT!
Oct 22 23:25:59 ubuntu NetworkManager[964]: <warn> error monitoring device for netlink events: error processing netlink message: Object busy
Oct 22 23:25:59 ubuntu avahi-daemon[623]: Joining mDNS multicast group on interface CX5130_x001.IPv6 with address fe80::201:5ff:fe1d:7dac.
Oct 22 23:25:59 ubuntu avahi-daemon[623]: New relevant interface CX5130_x001.IPv6 for mDNS.
Oct 22 23:25:59 ubuntu avahi-daemon[623]: Registering new address record for fe80::201:5ff:fe1d:7dac on CX5130_x001.*.
Oct 22 23:25:59 ubuntu kernel: [   13.706520] EtherCAT WARNING 0: 20 datagrams TIMED OUT!
Oct 22 23:26:00 ubuntu kernel: [   14.748575] EtherCAT WARNING 0: 20 datagrams TIMED OUT!
Oct 22 23:26:01 ubuntu dbus[443]: [system] Activating service name='org.freedesktop.UPower' (using servicehelper)
Oct 22 23:26:01 ubuntu dbus[443]: [system] Successfully activated service 'org.freedesktop.UPower'
Oct 22 23:26:01 ubuntu dhclient: DHCPDISCOVER on CX5130_x001 to 255.255.255.255 port 67 interval 5 (xid=0x730db6c3)
Oct 22 23:26:02 ubuntu kernel: [   15.790895] EtherCAT WARNING 0: 20 datagrams TIMED OUT!
Oct 22 23:26:02 ubuntu kernel: [   15.893849] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:26:02 ubuntu kernel: [   15.946642] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:26:02 ubuntu kernel: [   16.000730] (unregistered net_device): BUG! Tx Ring full when queue awake!

At this time the 'ethercat master' command returns something like this:

Master0
  Phase: Idle
  Active: no
  Slaves: 0
  Ethernet devices:
    Main: 00:01:05:1a:d5:6d (attached)
      Link: UP
      Tx frames:   63
      Tx bytes:    3780
      Rx frames:   0
      Rx bytes:    0
      Tx errors:   1294
      Tx frame rate [1/s]:      0      0      0
      Tx rate [KByte/s]:      0.0    0.0    0.1
      Rx frame rate [1/s]:      0      0      0
      Rx rate [KByte/s]:      0.0    0.0    0.0
    Common:
      Tx frames:   63
      Tx bytes:    3780
      Rx frames:   0
      Rx bytes:    0
      Lost frames: 63
      Tx frame rate [1/s]:      0      0      0
      Tx rate [KByte/s]:      0.0    0.0    0.1
      Rx frame rate [1/s]:      0      4   2405
      Rx rate [KByte/s]:      0.0    4.7 1455.5
      Loss rate [1/s]:          0      0      0
      Frame loss [%]:         0.0  100.0  100.0
  Distributed clocks:
    Reference clock: None
    Application time: 0
                      2000-01-01 00:00:00.000000000

On an unsuccessful unload/reload of ec_ccat:

Oct 22 23:30:33 ubuntu kernel: [  287.196371] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:30:33 ubuntu kernel: [  287.249216] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:30:33 ubuntu kernel: [  287.301650] EtherCAT 0: Master thread exited.
Oct 22 23:30:33 ubuntu kernel: [  287.301675] EtherCAT 0: Releasing main device 00:01:05:1A:D5:6D.
Oct 22 23:30:40 ubuntu kernel: [  293.382320] ec_ccat: registered ec_ccat as gpio chip with #12 GPIOs.
Oct 22 23:30:40 ubuntu kernel: [  293.382606] EtherCAT: Accepting 00:01:05:1A:D5:6D as main device for master 0.
Oct 22 23:30:40 ubuntu kernel: [  293.382624] EtherCAT 0: Starting EtherCAT-IDLE thread.
Oct 22 23:30:40 ubuntu kernel: [  293.382691] (unregistered net_device): NIC Link is Up
Oct 22 23:30:40 ubuntu kernel: [  293.382715] EtherCAT 0: Link state of ecm0 changed to UP.
Oct 22 23:30:40 ubuntu kernel: [  293.384380] EtherCAT WARNING 0: 8 datagrams TIMED OUT!
Oct 22 23:30:40 ubuntu kernel: [  293.386028] ec_ccat: registered ccat_update0.
Oct 22 23:30:40 ubuntu kernel: [  293.386073] ec_ccat: ccat_sram_probe: 0x0016 rev: 0x0000
Oct 22 23:30:41 ubuntu kernel: [  294.431521] EtherCAT WARNING 0: 20 datagrams TIMED OUT!
Oct 22 23:30:42 ubuntu kernel: [  295.472201] EtherCAT WARNING 0: 20 datagrams TIMED OUT!
Oct 22 23:30:43 ubuntu kernel: [  296.511475] EtherCAT WARNING 0: 20 datagrams TIMED OUT!
Oct 22 23:30:43 ubuntu kernel: [  296.669281] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:30:43 ubuntu kernel: [  296.721893] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:30:43 ubuntu kernel: [  296.774468] (unregistered net_device): BUG! Tx Ring full when queue awake!

On a successful unload and reload of ec_ccat:

Oct 22 23:32:34 ubuntu kernel: [  408.038227] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:32:34 ubuntu kernel: [  408.090745] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:32:34 ubuntu kernel: [  408.143279] (unregistered net_device): BUG! Tx Ring full when queue awake!
Oct 22 23:32:34 ubuntu kernel: [  408.188016] EtherCAT 0: Master thread exited.
Oct 22 23:32:34 ubuntu kernel: [  408.188049] EtherCAT 0: Releasing main device 00:01:05:1A:D5:6D.
Oct 22 23:32:37 ubuntu kernel: [  411.085382] ec_ccat: registered ec_ccat as gpio chip with #12 GPIOs.
Oct 22 23:32:37 ubuntu kernel: [  411.085704] EtherCAT: Accepting 00:01:05:1A:D5:6D as main device for master 0.
Oct 22 23:32:37 ubuntu kernel: [  411.085726] EtherCAT 0: Starting EtherCAT-IDLE thread.
Oct 22 23:32:37 ubuntu kernel: [  411.085788] (unregistered net_device): NIC Link is Up
Oct 22 23:32:37 ubuntu kernel: [  411.085809] EtherCAT 0: Link state of ecm0 changed to UP.
Oct 22 23:32:37 ubuntu kernel: [  411.086907] ec_ccat: registered ccat_update0.
Oct 22 23:32:37 ubuntu kernel: [  411.086995] ec_ccat: ccat_sram_probe: 0x0016 rev: 0x0000
Oct 22 23:32:37 ubuntu kernel: [  411.090003] EtherCAT WARNING 0: 2 datagrams TIMED OUT!
Oct 22 23:32:37 ubuntu kernel: [  411.090014] EtherCAT WARNING 0: 1 datagram UNMATCHED!
Oct 22 23:32:37 ubuntu kernel: [  411.094116] EtherCAT 0: 14 slave(s) responding on main device.
Oct 22 23:32:37 ubuntu kernel: [  411.094128] EtherCAT 0: Slave states on main device: PREOP.
Oct 22 23:32:37 ubuntu kernel: [  411.094226] EtherCAT 0: Scanning bus.
Oct 22 23:32:39 ubuntu kernel: [  412.698595] EtherCAT 0: Bus scanning completed in 1604 ms.
Oct 22 23:32:39 ubuntu kernel: [  412.698606] EtherCAT 0: Using slave 0 as DC reference clock.
Oct 22 23:32:39 ubuntu kernel: [  412.700210] EtherCAT: Requesting master 0...
Oct 22 23:32:39 ubuntu kernel: [  412.700226] EtherCAT: Successfully requested master 0.
Oct 22 23:32:39 ubuntu kernel: [  412.709429] EtherCAT 0: Domain0: Logical address 0x00000000, 114 byte, expected working counter 32.
Oct 22 23:32:39 ubuntu kernel: [  412.709440] EtherCAT 0:   Datagram domain0-0-main: Logical offset 0x00000000, 114 byte, type LRW.
Oct 22 23:32:39 ubuntu kernel: [  412.709833] EtherCAT 0: Master thread exited.
Oct 22 23:32:39 ubuntu kernel: [  412.709858] EtherCAT 0: Starting EtherCAT-OP thread.
Oct 22 23:32:39 ubuntu kernel: [  412.710010] EtherCAT WARNING 0: 1 datagram TIMED OUT!
Oct 22 23:32:39 ubuntu kernel: [  412.742278] EtherCAT WARNING 0: No app_time received up to now, but master already active.
Oct 22 23:32:39 ubuntu kernel: [  412.784239] EtherCAT 0: Domain 0: Working counter changed to 1/32.
Oct 22 23:32:39 ubuntu kernel: [  412.895474] EtherCAT WARNING 0-2: Slave does not support changing the PDO mapping!
Oct 22 23:32:39 ubuntu kernel: [  412.895484] EtherCAT WARNING 0-2: Currently mapped PDO entries: 0x7010:01/1 0x7010:02/1 0x7010:03/1 0x7010:04/1 0x0000:00/4 0x0000:00/8 0x7010:11/32. Entries to map: 0x7010:01/1 0x7010:02/1 0x7010:03/1 0x7010:04/1 0x0000:00/12 0x7010:11/32
Oct 22 23:32:39 ubuntu kernel: [  413.102650] EtherCAT WARNING 0-3: Slave does not support changing the PDO mapping!
Oct 22 23:32:39 ubuntu kernel: [  413.102660] EtherCAT WARNING 0-3: Currently mapped PDO entries: 0x6000:01/1 0x6000:02/1 0x6000:03/2 0x6000:05/2 0x6000:07/1 0x0000:00/1 0x0000:00/6 0x1800:07/1 0x1800:09/1 0x6000:11/16. Entries to map: 0x6000:01/1 0x6000:02/1 0x6000:03/2 0x6000:05/2 0x6000:07/1 0x0000:01/1 0x0000:02/6 0x1800:07/1 0x1800:09/1 0x6000:11/16
Oct 22 23:32:40 ubuntu kernel: [  413.714330] EtherCAT WARNING 0: 6 datagrams UNMATCHED!
Oct 22 23:32:40 ubuntu kernel: [  413.715135] EtherCAT WARNING: Datagram ffff880096230d98 (domain0-0-main) was SKIPPED 6 times.
Oct 22 23:32:40 ubuntu kernel: [  413.787102] EtherCAT 0: Domain 0: 7 working counter changes - now 13/32.
Oct 22 23:32:40 ubuntu kernel: [  414.508889] EtherCAT 0: Slave states on main device: OP.
Oct 22 23:32:41 ubuntu kernel: [  414.792978] EtherCAT 0: Domain 0: 5 working counter changes - now 21/32.

I am aware of the PDO mapping warnings displayed (this system happened to have a new type of slice), but the stalls were occurring well before I added that device's support, plus the stalls are before any comms with the slaves anyway.

Beckhoff Automation GmbH & Co. KG | Managing Director: Dipl. Phys. Hans Beckhoff
Registered office: Verl, Germany | Register court: Guetersloh HRA 7075

Beckhoff Automation GmbH & Co. KG | Managing Director: Dipl. Phys. Hans Beckhoff
Registered office: Verl, Germany | Register court: Guetersloh HRA 7075

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20151027/5616f34c/attachment-0003.htm>


More information about the Etherlab-dev mailing list