[etherlab-dev] ethercat-1.5: Various issues

Thu Jun 26 04:14:00 CEST 2014

On 26 June 2014, quoth Frank Heckenbach:
> OK. (I suppose you mean ec_slave_datagram_from_buffer instead of
> ec_slave_buffer_to_datagram.)

Yes, sorry.

> - Switching from wrap-around to real arithmetic makes things clearer
>   (calculating with wrap-around is very confusing; writing them out
>   allows for normal mathematical reasoning):

It is indeed very confusing, but your explanation makes sense.  Thank you.

> - In your previous mail you said the timeout was 790269982us most
>   of the time?! You must have done something different there,
>   because with the above assumptions (using jiffies, HZ=250, 32 bit)
>   the maximum possible value of that calculation is
>   (2^32 - 1) / 250 = 17179869.

I did see both values; I don't recall at the moment exactly which
circumstances triggered each one.

> But I'm not sure how this can happen: Either due to a race condition
> e.g. with ecrt_master_send called by ec_cdev_ioctl_send or
> ec_master_send_ext etc. (which may have been the case when I made this
> particular change; I don't remember if I fixed the locking bugs before
> or after it; but you have applied my locking patches and done proper
> locking in your application, haven't you?) or indirectly through device-
> >poll() (that's some complicated code paths, can't see it right now). To
> check the latter, you could debug-print jiffies again after device-
> >poll() in ec_device_poll.

I have not applied your locking patches as yet, partly because the 1.5.2
code seems quite different and I'm not sure how to correctly apply them, and
because most of them seem to be focused on locking for RTAI applications,
and (at least at the moment) I'm using a userspace app, which has a somewhat
different locking model.

I've also left out most of your EoE-focused patches, because I don't have
any EoE devices at the moment so I couldn't test them anyway.

> If it's something else (which I don't see ATM), you could make it wrap-
> around-safe by changing
> 
>   jiffies_poll > jiffies_sent
> 
> to
> 
>   jiffies_poll - jiffies_sent > timeout_jiffies
> 
> and likewise with cycles. (But I wouldn't recommend this without finding
> the root cause of the problem.)

That's exactly what the existing (pre-patch) code already does though, which
is why I said that your added check was redundant with the wraparound-safe
version.  (Removing it did, however, cause those weird timeout warnings,
until I made those other changes I've already mentioned, and detailed
below.)

> One reason would be patch #16 which means datagrams can be queued and
> not sent for a while (instead of corrputing other frame data).
> If so, the change would be in the wrong patch, of course. I'm not 100%
> sure at the moment if something in #11 doesn't also require it, but I
> think #16 is reason enough, so I don't have to check #11 in detail, do
> I?

It didn't seem to me that #16 needed it, but maybe I missed something.

I've attached the complete set of your patches that I currently have
applied, including my modifications both for bringing them up to 1.5.2 and
to try to resolve some of the issues that I encountered, along with a few
cosmetic changes (mostly spacing).  It's possible that one of the patches I
didn't apply contained the "missing link", but they did seem reasonably
consistent on a read-through at least.

The differences from your patches:

04-string-download:
  Just cosmetic.

07-sdo-up-download:
  Also made similar changes to ecrt_master_sdo_download_complete.

08-mrproper:
  No changes.

09-mailbox-tag:
  Temporarily added a definition of EC_MBOX_NO_PROTOCOL to allow it to
compile without future patches.

10-mailbox-allocate-buffer:
  Removed the temporary definition since this patch has the "real" one.

11-mailbox-buffer:
  Changed ec_slave_datagram_to_buffer and ec_slave_datagram_from_buffer to
use uint8_t protocol.
  When fetching a mailbox response from the buffer, sets the sent/received
timestamps to {cycles,jiffies}_poll instead of the current time. [Partially
fixes an issue with invalid values printed during timeout checks.]
  Omits your change to time out QUEUED datagrams as well as SENT ones.
[Remainder of fix for invalid timeout values.]
  Omits {cycles,jiffies}_poll > {cycles,jiffies}_sent comparison when
checking timeouts.  [Not wraparound-safe.]
  Removed declaration of last_index variable, as not actually used up to
this patch.

12-fetch-check:
  Just moved due to 1.5.0/1.5.2 differences.

13-send-retry:
  Just cosmetic.

14-index-reuse:
  Added back declaration of last_index variable, as it's needed now.
  Otherwise just cosmetic/version differences.

16-frame-corruption:
  Just cosmetic/version differences.

25-output-stats:
  Just version differences.

26-clear-mailbox:
  Just cosmetic/version differences.

27-coe-lock:
  Just version differences.

28-dictionary-fetched:
  Also added to ec_slave_info_t (for ecrt_master_get_slave).
  Bumped EC_IOCTL_VERSION_MAGIC to allow userspace apps to detect
incompatible versions.

29-init-restart:
  No changes.

30-debug-level:
  New patch; alters the debug level checks from patch 11 to level 2 from
level 1 for the messages printed in the common case, to reduce log flooding
at level 1.  (I made this a separate patch because the original behaviour
can be useful when testing.)

With these patches applied against the current stable-1.5 head, the master
behaves more reliably in testing.  I have not noticed any further problems
thus far.  (Note that my testing is all from a userspace app with no EoE
slaves, which is why I didn't need some of your other patches to get things
stable.)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fheckenbach-gavinl.tar.bz2
Type: application/octet-stream
Size: 14359 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20140626/8f20151d/attachment-0001.obj>