[etherlab-dev] [PATCH] A whole lotta patchin' goin' on

Tue Mar 17 08:31:29 CET 2015

On 13 March 2015 16:12, quoth I:
> gavinl-1011-e1000e_watchdog:
>   This resolves an issue with the e1000e driver that I mentioned earlier
--
> when wired for cable redundancy, the second port didn't establish link
until
> the network broke, causing an unacceptable delay in failover to the
> redundant connection.  Turns out the problem was that the port watchdog
has
> the job of detecting link up/down and the watchdog was not run if the port
> was receiving packets, even if it didn't think it had a link.  (With
> redundant wiring, it would transmit on the main link and receive back on
the
> backup link, resetting the backup link's watchdog each time so that it
never
> ran.)  This patch removes the reset of watchdog on receive, so that the
> watchdog runs every 2 seconds regardless.
>   I haven't checked the other network drivers to see if they're similarly
> afflicted.

After a bit more testing, I need to revise this patch.  It causes ~450us of
extra delay inside ecrt_master_receive whenever the 2 second timer hits,
which I think we can all agree is a bad thing.

On looking closer at the older kernel versions, I noticed that from 2.6.35
and earlier the watchdog task was being scheduled to a kernel worker thread,
while from 2.6.37 and later it was changed to perform this directly on the
master application thread.  Does anyone recall what the reason for this
change was, or whether it was accidental?  It seems to have happened in
commit c350fc89afd7ac6bb64b706bbc333df5e53e3d2f.

(Note that prior to this patch on all versions it would simply never execute
the watchdog task as long as it was receiving packets, meaning that the
stats calculations and other housekeeping tasks that seem to be part of this
don't get performed; I'm not familiar enough with the driver/hardware
internals to know whether this is a good thing or not.  Given the cyclic
nature of EtherCAT, there is rarely a time that ports stop receiving
packets.)

In the revised patch (attached), I've chosen to continue running the
watchdog every 2 seconds even if RX happens (which fixes redundancy) but
I've moved the watchdog work back to the worker thread (on 2.6.37+) to avoid
holding up ecrt_master_receive.  There is a slight race with the timer reset
as a result (it doesn't take the time required to run the watchdog task into
account) but as this is 2 seconds vs. ~500us that seems reasonably safe --
and it's what happened in the older kernel versions as well.

I did consider an alternate patch which still avoids calling the watchdog if
the port is receiving data, but I'm not convinced there's value in avoiding
the "link_up" work in the watchdog task, especially when it's being done on
a worker thread.  Perhaps someone more familiar with this could enlighten
me?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gavinl-1011-e1000e_watchdog.patch
Type: application/octet-stream
Size: 15123 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20150317/3affbc47/attachment.obj>