[etherlab-dev] ethercat-1.5: Various issues

Thu Jun 12 19:45:27 CEST 2014

Gavin Lambert wrote:

> (Also
> there's a few cases where variables introduced in one patch aren't used
> until a later patch, and the reverse, so some of the intermediate states
> won't compile.)

The former shouldn't stop compilation (except for some warnings),
the latter (used before introduction) would be a mistake on my side.
If they cause problems, let me know which ones, and I'll try to
rearrange my patches.

> > Indeed, it doesn't, but the development code as obtained by the "hg
> > clone" command as listed on http://etherlab.org/de/ethercat/index.php
> > does. Though it uses
> > master->io_mutex instead of master->io_sem now. For what I can tell
> > (seeing that 1.5.0 doesn't have io_mutex at all), an intermediate
> > version introduced the code with io_sem (which I then took) and was
> > later changed to io_mutex along with other changes. (This all might not
> > be very relevant except to explain where I got the code from; otherwise
> > you can treat it like it was one of my patches, I guess.)
> 
> I'm not really sure what's going on with the default branch, but as best I
> can tell it's outdated and should be ignored.  All the new changes are on
> the stable-1.5 branch.  (There hasn't been anything committed to "default"
> since 2011.)

Uhm, did I get that right? The code obtained by the command
described on the website as:

: The following command can be used to clone the repository in order
: to get the latest revisions:
: hg clone http://hg.code.sf.net/p/etherlabmaster/code ethercat-hg

That's not the "latest revisions", in fact it's even older code than
the last few releases?

Well, if that's so, that's quite the bummer!

(And it won't help with integration of my patches, since I've
sometimes looked at this code for the general direction to take, in
these two cases discussed and a few others. Now if that's exactly
the wrong direction, well ...)

Alright, I've long considered if I should post the following, but I
think you (and other interested parties) ought to know about it, not
as a kind of dirty laundry, just so to know what you might be at
when you deal with my patches:

As I mentioned previously, I did report the first few of the
problems and sent patches back in 2011
(http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html).
Unfortunately, all of my patches, even back then were completely
ignored[1], including a serious bug in the e1000 driver which
rendered it basically unusable[2].

There was a single reply
(http://lists.etherlab.org/pipermail/etherlab-users/2011/001276.html)
in which Florian basically told me that he's very busy, and that
he'll "check and include" my patches. (3 years later I think it's
fair to say that only one of those statements turned out true.) My
other bug reports for some of the problems addressed in my later
patches
(http://lists.etherlab.org/pipermail/etherlab-users/2011/001272.html)
went completely unanswered.

We even tried to contact IgH by phone, but were basically told,
their developers are too busy with other stuff, and they're not very
interested in our application anyway, so this went nowhere either.

Meanwhile we had a project to complete, so without any support (and
no interest by others (like you) on the lists back then), I just did
what I needed to do on my own to finish the project (late but not
too late ;) and then left it at that.

I think several of the issues I fixed are serious bugs in the code,
and when I occasionally peeked at the lists and saw prople like Jun
Yuan who had apparently similar problems, I now sent my patches for
your benefit (hopefully).

However, seeing the current direction the code is going (which I now
know is 1.5.2, not the hg version), it doesn't seem very interesting
to me, so I guess my platform will remain 1.5.0 plus my patches.[3]

As I mentioned, I will probably do some work on our EtherCAT
application this year, but this will be (probably) compararably easy
stuff with (hopefully) no new bugs found and (quite certainly) not
requiring a new EtherCAT version. So you might not hear from me
about that at all on the lists, and in fact, when that's done, I'll
probably unsubscribe from the lists.

So this leaves my patches somewhere in limbo. I won't maintain or
port them to new versions[4], and if IgH still doesn't care about
them now, you might be stuck with them as an inoffical fork too
(like I am).

Just think you should know that so you can decide what to do.

[1] except for a trivial change relating to newer kernels
    (ethercat-1.5-header.patch) which they made have made without my
    patch because it was obvious

[2] We were basically told, e1000 is broken, don't use it, use
    r8169. Well, my experience was different, we tried r8169, but
    also had some problems with it (I didn't debug them further),
    whereas e1000, after fixing this bug (patch #02) worked and
    still works very reliably for us.

[3] I might just need drivers for newer kernels, but it now seems
    easier to "backport", or perhaps just copy them from 1.5.2 etc.
    into my 1.5.0 tree.

[4] More precisely, my current client will not pay me to do so (for
    good reason, there would be no direct benefit to him, actually
    some risk, you know, never change a running system etc.)

>  - there's a very large number of "overriding mailbox check = 0" "buffering
> mailbox response" "overriding mailbox check = 1" "fetching mailbox response"
> sequences.  Does this just happen for every mailbox exchange or is it
> significant (eg. showing an out-of-order response)?

Yes, that's normal. It shows that my patch is working.

>  - I'm seeing a higher number of errors logged while fetching the SDO
> dictionary than I recall happening beforehand ("invalid entry description
> response" mostly).  Although if I run "ethercat sdos" then I do see the
> correct information (presumably it's retrying?).

I don't think this should happen. Can you check what the incorrect data
are (printed in the debug output, refer to the condition before that
error message to see what's wrong)?

The retrying from my patches should happen at a lower level before
it gets there. However, it does high level retry here too (which is
probably why you get correct information in the end). I'm not sure
why it does so, maybe there are other reasons for invalided reponses
here which were handled before and are independent of my changes,
though I don't know what reasons and I don't see why my changes
should make them more likely.

>  - there's some very suspicious timeout warnings "timed out datagram xxx,
> index 00 waited 790269982 us."  (the time does seem to be the same most of
> the time)

Looks like an unset datagram got here. This would explain index 00,
and (if it had cycles_send or jiffies_sent == 0) might explain the
strange timing.

Either an unset datagram is queued in ec_master_queue_datagram, or a
datagram already queued is overwritten somewhere (which might be a
more serious problem). If you want to debug it, you can start there.

> > I admit I'm not very proficient with hg, so I probably mixed up the
> > commits. I'd have to read up on it and dig deeper, maybe you're faster
> > at it. In any case, the cloned code (see above) does contain lock_cb and
> > unlock_cb in place of send_cb and receive_cb as in 1.5.0 and 1.5.2. So I
> > figure, Florian made this change, but hasn't pushed it into a release
> > yet.
> 
> I haven't traced the history, but I suspect that (as this was on the default
> branch), this was the code before send_cb and receive_cb were introduced in
> the first place.  So changing it back would presumably be a regression.

Well, from my point of view, of course, the 1.5.0/1.5.2 (and as it
now seems, current) code is a regression from 1.4.0 for two reasons:

- Code structure: The "configurable" thing about the callbacks is
  not the send/receive part (this part is exactly proscribed, see
  the comment in ecrt.h), but the "rest" which means locking or
  similar. So it "feels strange" that some calls that are logically
  integral to the caller are delegated to callbacks (via these
  comments), instead of letting the callbacks deal with what they
  should really be concerned about.

- Scalability: The version in 1.4.0 (and my patches) need two
  callbacks, lock and unlock. The 1.5.x version needs N callbacks,
  one for each operation. They thought N == 2 (send and receive), so
  it wouldn't matter, but as I explained in my initial mail, many
  more operations need to be locked. I think it's obvious that
  having to provide a callback for each of them is not viable.

  Of course, a completely different locking regime may be
  conceivable which doesn't require that, but I haven't seen one
  yet. What I've seen in 1.5.x is simply broken (especially in RTAI,
  but not only, as I explained), so it's no alternative to me. (As
  usual, locking bugs will manifest themselves rarely and
  unreproducibly, that's probably why the problem wasn't noticed
  more, but if we take that as a reason not to do locking properly,
  we can just remove all locks completely, right?)

I still don't know the motivation for this change, perhaps it's to
let the callbacks skip execution if they want (i.e., if no lock can
be obtained). But that's still no reason for this unscalable
approach. An alternative would be a kind of try-lock callback in the
style of pthread_mutex_trylock (refer to its manpage for semantics
in case of doubt). And I'm not even sure of the usefulness of that
-- in my experience, from low-level to high-level code, I almost
never use try-locking because if I did, the upper level would
usually just have to wait and retry. So it's easier to combine this
with locking, i.e. pthread_mutex_lock which automatically waits, or
like here in my patch #22, in request_lock_callback, do the waiting
close to the locking. (In ec_master, maybe the FSMs can keep running
when they can't send/receive datagrams, but what would they do in
this time? Just cycle around waiting for the datagrams to be sent
and received. So it seems easier to wait directly.)

That said, I'm not opposed to a try-lock approach, if this is
favoured for some reason. But I am opposed to a buggy approach. ;)

> > I don't know about your application, but in my case, I only want to run
> > with the correct (configured) number of slaves anyway.
> > Therefore, in my cyclic code I always call ecrt_master_state and check
> > that in the result link_up is set and slaves_responding is equal to the
> > number I expect; otherwise I abort immediately because something's
> > fishy.
> 
> Mine is a bit more flexible; I'm trying to get it to support hotplugging as
> cleanly as possible if I can, along with a bit of autodetection.

OK, to me the world of realtime industrial communication is about
the polar opposite of hotplugging systems. ;) But I guess it depends
on the application. Anyway, communication will not last over a
rescan, i.e. pending SDO transfers will be aborted, PDOs might be
remapped, EoE, not only connections, but the virtual EoE devices
will be reset (and possible come back with a different name) etc.
That's the case with and without my patches.

> > ec_fsm_coe_exec could return 0 in 1.5.0 already. The return value is
> > only used by those callers that go on to do other things after calling
> > it (to return early if 0); those that call it at their end don't (need
> > to) check the result, and I think that's ok.
> > 
> > One of my changes was that ec_fsm_master_exec doesn't always return
> > 1 after executing a state (it still does in 1.5.2). It's a consequence
> > of my previous changes: If CoE is reserved by the slave FSM, the master
> > FSM must wait. In order to do that I set
> > datagram->state to EC_DATAGRAM_INVALID, and prevent it from being
> > queued. Returning 0 from ec_fsm_master_exec is an easy way to achieve
> > this (for both callers) since the 0 return (the first branch in
> > ec_fsm_master_exec) was already there in case the FSM is still waiting
> > for a reply and doesn't execute a state at all (and therefore obviously
> > doesn't send a new datagram either).
> 
> But in 1.5.2 ec_fsm_coe_exec has "datagram_used" and could return 0 when the
> datagram was in INIT, QUEUED, or SENT, which I assumed would occur in some
> of the middle states while it was waiting for something to happen, which
> didn't sound that different from what you were trying to do.

I don't know. The point of my patch #27 was to sequentialize CoE access
by the "master" and "slave FSMs". In 1.5.0 this wasn't done at all,
and datagrams wouldn't be in one of those state for this reason,
they'd just be sent out and get conflicting answers. Maybe this
problem is fixed in 1.5.2 and the change in ec_fsm_coe_exec is a
part of this (it's certainly not the main thing; this must be other
changes if so).

Regards,
Frank

-- 
Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359
Systems Programming, Software Development, IT Consulting