[etherlab-users] Redundancy support
Gavin Lambert
gavinl at compacsort.com
Wed Mar 4 08:29:02 CET 2015
On 27 February 2015 22:06, quoth Richard Hacker:
> > I have a question regarding support for cable redundancy in the
> > stable-1.5 branch.
> >
> > I know that it has options for enabling a "backup" network port on the
> > PC and connecting the end of a single chain to this port. Presumably
> > this is mostly transparent to the application code (although it can
> > query for status)?
> >
> > Does it also support redundant tree links similarly?
>
> In principle it should work, although I have not tested it. The trick with
> redundancy is, that the number of visible slaves and the order of packet
> traversal must not change when a single link is destroyed.
>
> You are quite correct in the assumption that redundancy is transparent to
the
> application. The status is only required to report a redundant state or
not,
> otherwise redundancy would be useless to the user. The state is not
required
> by the application to select another source/destination of data.
Yes, that's all I was thinking of, to display some sort of warning to the
user that their network might have issues.
On a related note though, I've been testing basic redundancy (a single loop
without internal subloops) recently and I've noticed some things that seem
odd to me:
1. On a two slave network with the break between the two (so one slave on
each master link), the log messages identify both slaves as "0-0", making it
hard to see what's going on. I've already written a patch to improve this,
which I'll include in the patch bundle that I've been threatening to send to
the dev list for a few months now. ;)
2. There appear to be a few things that only seem to work on the main link,
not the backup link (unless I'm missing something). Register requests
(maybe only some types?) seem to be one of them, and I'm dubious about the
DC sync behaviour as well -- I don't think the RMW broadcast sync to the
refclock is really going to work on a link that doesn't contain the
refclock. The transmission delay measurements seem incorrect too.
3. Whenever the etherlab master service is started (with the network
initially in "good" state), the first time that the network breaks and
redundancy is activated takes about 2 seconds to resolve (which seems to be
a standard network link-up delay). If the break is then fixed, future
breaks in the same spot resolve almost instantly. (I haven't yet tested
with a large enough network to check breaks in different places.)
The below is an example of the syslog output when the slave0 -> slave1 link
is broken and the slave1 <- backup link needs to pick up the slack.
[ 1368.157824] e1000e: ecb0 NIC Link is Up 100 Mbps Full Duplex, Flow
Control: None
[ 1368.157829] ec_e1000e 0000:01:00.1: (unregistered net_device): 10/100
speed: disabling TSO
[ 1368.157831] EtherCAT 0: Link state of ecb0 changed to UP.
[ 1368.157960] EtherCAT WARNING 0: Domain 0: Redundant link in use!
On both master and slave the LINK/ACT lights are lit on the redundant ports
both before and after this event (it's a two-port adapter, in case that
makes a difference), so I'm not sure why the driver is announcing a link-up
at this time instead of earlier. In case it helps, this is the initial
output when the master is loaded:
[ 3620.561200] EtherCAT: 1 master waiting for devices.
[ 3635.431476] ec_e1000e: EtherCAT-capable Intel(R) PRO/1000 Network Driver
- 1.5.1-k-EtherCAT
[ 3635.431479] ec_e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[ 3635.431501] ec_e1000e 0000:01:00.0: Disabling ASPM L1
[ 3635.431520] ec_e1000e 0000:01:00.0: setting latency timer to 64
[ 3635.431606] ec_e1000e 0000:01:00.0: irq 41 for MSI/MSI-X
[ 3635.604415] EtherCAT: Accepting 68:05:CA:0A:99:18 as main device for
master 0.
[ 3635.748669] ec_e1000e 0000:01:00.0: irq 41 for MSI/MSI-X
[ 3635.804370] ec_e1000e 0000:01:00.0: (unregistered net_device): MSI
interrupt test failed, using legacy interrupt.
[ 3635.804398] ec_e1000e 0000:01:00.0: (unregistered net_device): (PCI
Express:2.5GT/s:Width x4) 68:05:ca:0a:99:18
[ 3635.804401] ec_e1000e 0000:01:00.0: (unregistered net_device): Intel(R)
PRO/1000 Network Connection
[ 3635.804476] ec_e1000e 0000:01:00.0: (unregistered net_device): MAC: 0,
PHY: 4, PBA No: D50868-008
[ 3635.804487] ec_e1000e 0000:01:00.1: Disabling ASPM L1
[ 3635.804500] ec_e1000e 0000:01:00.1: setting latency timer to 64
[ 3635.804581] ec_e1000e 0000:01:00.1: irq 41 for MSI/MSI-X
[ 3635.980331] EtherCAT: Accepting 68:05:CA:0A:99:19 as backup device for
master 0.
[ 3636.124622] ec_e1000e 0000:01:00.1: irq 41 for MSI/MSI-X
[ 3636.180287] ec_e1000e 0000:01:00.1: (unregistered net_device): MSI
interrupt test failed, using legacy interrupt.
[ 3636.180315] EtherCAT DEBUG 0: ORPHANED -> IDLE.
[ 3636.180316] EtherCAT 0: Starting EtherCAT-IDLE thread.
[ 3636.180363] ec_e1000e 0000:01:00.1: (unregistered net_device): (PCI
Express:2.5GT/s:Width x4) 68:05:ca:0a:99:19
[ 3636.180366] EtherCAT DEBUG 0: Idle thread running with send interval =
4000 us, max data size=45000
[ 3636.180369] ec_e1000e 0000:01:00.1: (unregistered net_device): Intel(R)
PRO/1000 Network Connection
[ 3636.180446] ec_e1000e 0000:01:00.1: (unregistered net_device): MAC: 0,
PHY: 4, PBA No: D50868-008
[ 3637.806692] e1000e: ecm0 NIC Link is Up 100 Mbps Full Duplex, Flow
Control: None
[ 3637.806696] ec_e1000e 0000:01:00.0: (unregistered net_device): 10/100
speed: disabling TSO
[ 3637.806699] EtherCAT 0: Link state of ecm0 changed to UP.
[ 3637.814759] EtherCAT 0: 2 slave(s) responding on main device.
[ 3637.814762] EtherCAT 0: Slave states on main device: INIT, SAFEOP +
ERROR.
[ 3637.818835] EtherCAT DEBUG 0: Sending broadcast-write to measure
transmission delays on main link.
[ 3637.818887] EtherCAT DEBUG 0: 2 slaves responded to delay measuring on
main link.
[ 3637.818888] EtherCAT 0: Scanning bus.
[ 3637.818890] EtherCAT DEBUG 0: Scanning slave 0 on main link.
I'm expecting it to say that the ecb0 link is also up at this time, despite
not needing to talk to any slaves via that link yet since the main link is
sufficient. Instead this doesn't happen until a network break actually
occurs, which is too late if I want a smooth transition. ("ethercat slaves
-v" reports that the last slave thinks the backup link is up as well.)
Also possibly of interest is that if I disconnect/reconnect the backup link
while the main link is still working normally (even after the first fault),
then the link LEDs change as you'd expect but there is no syslog output in
either case.
Any hints where in the code I should be looking to resolve this? I've had a
look around but can't see anything obvious -- it looks like it should be
checking the link whenever e1000_watchdog_task is called, which should be
whenever ec_poll is called, which should be whenever ecrt_master_receive is
called, which should be all the time. Unless there's some quirk about it
being a dual-port board? It does work once the main link breaks somewhere
though.
More information about the Etherlab-users
mailing list