[etherlab-dev] ethercat-1.5: Various issues

Fri Jun 13 10:28:31 CEST 2014

Quoth Frank Heckenbach:
> [2] We were basically told, e1000 is broken, don't use it, use
>     r8169. Well, my experience was different, we tried r8169, but
>     also had some problems with it (I didn't debug them further),
>     whereas e1000, after fixing this bug (patch #02) worked and
>     still works very reliably for us.

I'm using a mix of r8169 and e1000e.  There were some bugs in e1000e when I
started, but they've been resolved now (that's one of the patches that did
make it into mainline).

I haven't looked at your e1000 patch yet, but maybe it made it in too?  The
1.5.x history does show some fixes to those drivers.

> The former shouldn't stop compilation (except for some warnings), the
> latter (used before introduction) would be a mistake on my side.
> If they cause problems, let me know which ones, and I'll try to
> rearrange my patches.

It's not a big deal.  The patches in question are related anyway so it's
unlikely they'd be applied piecemeal; I just wanted to have each patch in a
separate commit for tracking purposes.

> > I'm not really sure what's going on with the default branch, but as
> > best I can tell it's outdated and should be ignored.  All the new
> > changes are on the stable-1.5 branch.  (There hasn't been anything
> > committed to "default" since 2011.)
> 
> Uhm, did I get that right? The code obtained by the command described on
> the website as:
> 
> : The following command can be used to clone the repository in order
> : to get the latest revisions:
> : hg clone http://hg.code.sf.net/p/etherlabmaster/code ethercat-hg
> 
> That's not the "latest revisions", in fact it's even older code than the
> last few releases?

Correct.  It's vitally important to also do the next command mentioned,
which switches to the stable-1.5 branch.  (It also says the default branch
contains the "development version", which one might assume is more recent,
but it's not.  Click on the online repository browsing link below and look
at the dates on each branch.)

> There was a single reply
> (http://lists.etherlab.org/pipermail/etherlab-users/2011/001276.html)
> in which Florian basically told me that he's very busy, and that he'll
> "check and include" my patches. (3 years later I think it's fair to say
> that only one of those statements turned out true.) My other bug reports
> for some of the problems addressed in my later patches
> (http://lists.etherlab.org/pipermail/etherlab-users/2011/001272.html)
> went completely unanswered.

Yeah, I'm not sure what's up with that either.  I've also sent in several
patches which appear to have been largely ignored (though one or two did
make it in), and I know of others in the same boat.  Maybe they're really
really busy.

> However, seeing the current direction the code is going (which I now
> know is 1.5.2, not the hg version), it doesn't seem very interesting to
> me, so I guess my platform will remain 1.5.0 plus my patches.[3]

Well, 1.5.2 is also in hg, it's just in a different branch.  And there are
definitely many improvements in 1.5.2 over 1.5.0 (though possibly in areas
that don't affect you).  And there is still the occasional commit going into
it (a couple last month), so IgH *are* updating it, just possibly not with
the patches we might want as quickly as we want.

> As I mentioned, I will probably do some work on our EtherCAT application
> this year, but this will be (probably) compararably easy stuff with
> (hopefully) no new bugs found and (quite certainly) not requiring a new
> EtherCAT version. So you might not hear from me about that at all on the
> lists, and in fact, when that's done, I'll probably unsubscribe from the
> lists.

While you're free to do so, of course, I hope you don't.  Part of the beauty
of open source is that even if the original developers get distracted for a
while or even abandon something entirely (which I must point out again is
not the case here), the users of it can still share patches and keep it
updated.

I'm planning to set up a forked repository on SF consisting of the current
1.5.2 plus several of the patches I've submitted in the past, in the hopes
that maybe it'll be easier for IgH to do an hg pull rather than applying a
patch from a mailing list -- or failing that anyone who wants that version
could just use it as an alternate repository.

> >  - there's a very large number of "overriding mailbox check = 0"
> > "buffering mailbox response" "overriding mailbox check = 1" "fetching
> > mailbox response" sequences.  Does this just happen for every mailbox 
> > exchange or is it significant (eg. showing an out-of-order response)?
> 
> Yes, that's normal. It shows that my patch is working.

It's probably a little too spammy to stay like that long-term.  It should
log only when something unusual happens (or require level 2, maybe).  It
also seems kinda annoying to go through the trouble of buffering the
response in the happy-day case when there's nothing pending and it's for the
correct protocol already (if nothing else, it's two extra memcpys and one
extra state machine cycle, if I'm understanding it correctly), but I suppose
that's a side-effect of trying to shoe-horn it in to the low level without
altering the higher level state machines.

I changed several of these to level 2 in my local copy. :)

> >  - I'm seeing a higher number of errors logged while fetching the SDO
> > dictionary than I recall happening beforehand ("invalid entry
> > description response" mostly).  Although if I run "ethercat sdos" then
> > I do see the correct information (presumably it's retrying?).
> 
> I don't think this should happen. Can you check what the incorrect data
> are (printed in the debug output, refer to the condition before that
> error message to see what's wrong)?

It looks like this was just a side effect of running the initial test on my
build VM (which uses the generic driver).  This was behaving a bit flakier
than I remember it doing in the past but I don't think that's related to
your patches.  Running it on a real "EtherCAT ready" machine resolves these.

> >  - there's some very suspicious timeout warnings "timed out datagram
> > xxx, index 00 waited 790269982 us."  (the time does seem to be the
> > same most of the time)
> 
> Looks like an unset datagram got here. This would explain index 00, and
> (if it had cycles_send or jiffies_sent == 0) might explain the strange
> timing.
> 
> Either an unset datagram is queued in ec_master_queue_datagram, or a
> datagram already queued is overwritten somewhere (which might be a more
> serious problem). If you want to debug it, you can start there.

Ah.  I found a spot in ec_master_queue_datagram where I had incorrectly
applied patch 11 (and jiffies_sent would have been 0).  I've been
sidetracked a little and haven't had a chance to re-test this, but I expect
it will solve the issue; thanks for the hint!

(Part of the side-track suggested that patch 26 might not be sufficient to
solve that problem, but I haven't confirmed that yet, and it'll probably be
a few days before I get a chance to check it again.  And of course it's
possible that this was just another error on my part, or affected by the
above goof.)

> > > I admit I'm not very proficient with hg, so I probably mixed up the
> > > commits. I'd have to read up on it and dig deeper, maybe you're
> > > faster at it. In any case, the cloned code (see above) does contain
> > > lock_cb and unlock_cb in place of send_cb and receive_cb as in 1.5.0
> > > and 1.5.2. So I figure, Florian made this change, but hasn't pushed
> > > it into a release yet.
> >
> > I haven't traced the history, but I suspect that (as this was on the
> > default branch), this was the code before send_cb and receive_cb were
> > introduced in the first place.  So changing it back would presumably
> > be a regression.
> 
> Well, from my point of view, of course, the 1.5.0/1.5.2 (and as it now
> seems, current) code is a regression from 1.4.0 for two reasons:
[...]
> I still don't know the motivation for this change, perhaps it's to let
> the callbacks skip execution if they want (i.e., if no lock can be
> obtained). But that's still no reason for this unscalable approach.

So, I looked into the history now and I'm even more confused.

The callbacks were "improved" (that's pretty much all that the log message
says, but these were the send/receive callbacks) on Jul 13 2009 in changeset
1500:ed1a733efbc5 on the default branch.  In changeset 2024:96e2ae6cce95 (on
Dec 16 2010) there's a change from send/receive callbacks back to
lock/unlock callbacks (which is probably the one you saw).  Changeset
2063:06f3292f5b71 (on May 12 2011) created the stable-1.5 branch but it
appears to have bypassed that particular commit (and quite a few others, if
I'm reading this right).  I'm not sure why this was.  Another odd thing is
that (at least according to the log messages, I haven't checked the diffs)
for quite a while it looks like similar but separate commits were made to
both default and stable-1.5 instead of merging (which might make merging
tricky later), until commits to default just stopped with 2415:af21f0bdc7c9
in Sep 2012.

So in a way we're both right -- the callbacks did get changed back to
lock/unlock but for some reason that didn't end up in 1.5.

I don't know why the default branch has seemingly been abandoned/shelved;
that's something only IgH can answer.