[etherlab-dev] ethercat-1.5: Various issues

Mon Jun 30 02:55:33 CEST 2014

On 27 June 2014, quoth Frank Heckenbach:
> The issues regarding io_sem (#18, #19, #20) apply to all code, RTAI or
> not.

True, but like I said the current 1.5.2 locking code appears quite different
and I wasn't sure how to integrate it properly, so it seemed safest to leave
it alone for now and let Florian examine that part.

> (I also didn't take most of your reformatting changes. In my patches, I
> tried to stick to the apparent coding style in 1.5.0; maybe it has
> changed in 1.5.2, but since my patches are, and possibly might remain,
> for 1.5.0, I'll stick to that style. I only took a few changes where my
> version indeed didn't match that style.)

That's funny, that's exactly the reason why I did the reformatting in the
first place. :)

> Sorry, I meant the reverse check (but again, only as a last resort), so 
> datagrams that are just a little "too new" aren't timed out:
>
>  jiffies_sent - jiffies_poll > timeout_jiffies

See discussion below, but I don't think this is appropriate.

> > 11-mailbox-buffer:
> >   Omits your change to time out QUEUED datagrams as well as SENT ones.
> > [Remainder of fix for invalid timeout values.]
> 
> I'm not sure that's the fix, or why this would produce such values (did
> you try to debug it as I described in my last mail?), so for now I'll
> stick to my code which works for me.

The previous code starts tracking timeouts only after the packets are SENT;
as such it assumes that {jiffies,cycles}_sent is always <=
{jiffies,cycles}_poll (in modulo space, ie. allowing for wraparound).  The
cases where I was getting odd durations were when that assumption was
violated, specifically when (a) _sent/_received were set to a value higher
than _poll at the time when a datagram was fetched from a buffer instead of
from the network (because "now" was higher than _poll), and (b) when
QUEUED-but-not-SENT datagrams were evaluated for timeout, and thus the _sent
value is not yet set (and could be undefined).

Case (a) was fixed by using _poll instead of the current time when fetching
datagrams from the buffer.  Case (b) was fixed by not trying to time out
QUEUED datagrams.

I removed that change because I don't think even in light of patch #16 it is
useful to test QUEUED datagrams for timeout, as by existing definition they
cannot start to time out until they are SENT.  (Yes, patch #16 may extend
the time they're queued for another cycle, but I don't think that is
important in practice unless they can somehow get "lost" and never sent,
which did not appear to be the case.)

In any case, if you want to be able to time out QUEUED datagrams as well,
you'll have to set the _sent time to the time the datagram is queued, not
the time that it is actually sent.  (Or possibly both, depending on how you
want the existing timeout values to be affected.)

> > 28-dictionary-fetched:
> >   Also added to ec_slave_info_t (for ecrt_master_get_slave).
> >   Bumped EC_IOCTL_VERSION_MAGIC to allow userspace apps to detect
> > incompatible versions.
> 
> OK, but I think you should also add the line to ecrt_master_get_slave in
> master/master.c (not only to ecrt_master_get_slave in lib/master.c).

Fair point.  I will do that.

(Incidentally, while testing another issue that I'll make a post about
shortly, I found it useful to be able to detect when SDO reading of a
particular slave was in progress in addition to when it was completed, so I
extended this a little.  I'm not currently planning to keep those extra
changes though.)

> > 30-debug-level:
> >   New patch; alters the debug level checks from patch 11 to level 2
> > from level 1 for the messages printed in the common case, to reduce
> > log flooding at level 1.  (I made this a separate patch because the
> > original behaviour can be useful when testing.)
> 
> In my version, I made this change in #11 directly. (When testing, you
> can increase the debug level; having more different versions of the
> patches makes them even more unmaintainable than they apparently already
> are.)

I meant that given how much more verbose debug level 2 is than 1, it can be
useful when testing the behaviour of patch #11 itself to have it print the
information at debug level 1.  But then once you're happy that it's working
as expected, the "normal case" messages should be moved to level 2 to make
level 1 more useful.  That's why I put it into separate patches, so that the
intermediate state can be tested.  (You may note that I left some of the
messages at level 1 still, since they indicate that something unusual
happened and this is useful to know in operation.)  Maybe we just need some
additional debug levels in the end. :)

I already have the issue at debug level 1 that the kernel log's ringbuffer
sometimes fills up and messages get dropped.  Trying to make sense of things
at debug level 2 where that would happen even more frequently would quickly
be too painful to be viable.