[etherlab-dev] ethercat-1.5: Various issues

Wed Jun 25 19:57:40 CEST 2014

> I did make some other intentional changes from the patch:
>   - I changed the "protocol" parameter of ec_slave_datagram_to_buffer and
> ec_slave_buffer_to_datagram to uint8_t from uint16_t (since that seems more
> consistent).

OK. (I suppose you mean ec_slave_datagram_from_buffer instead of
ec_slave_buffer_to_datagram.)

>   - I left out the cycles_poll > cycles_sent and jiffies_poll > jiffies_sent
> checks in the timeout checking, since as I noted before these would not be
> safe against wraparound.
>
> If I put the jiffies_poll > jiffies_sent check back in, I do not get these
> 17171869us timeouts; but I'm unconvinced it's safe to leave this check in.
>
> Incidentally, reversing the calculation:
>   time_us = (unsigned int)((jiffies_poll - jiffies_sent) * 1000000 / HZ);
> where time_us = 17171869 and HZ = 250, suggests that (poll-sent) is -2
> jiffies (17171869 * 250 / 1000000 ~= -2), but then plugging that forwards
> through the formula I'm not sure why time_us != 4294959296 (aka -8000)
> instead.  (I did notice that (uint32_t)-2 * (uint64_t)(1000000 / 250) =
> 17179869176000, which at least has the right digits but they're at the wrong
> magnitude.)  I suspect I'm missing something fundamental (and obvious).

- Actually it's not the same digits, there's a slight (but
  important) difference.

- Switching from wrap-around to real arithmetic makes things clearer
  (calculating with wrap-around is very confusing; writing them out
  allows for normal mathematical reasoning):

  Your first observation says:

  (17171869 * 250 - 2^32) / 1000000 ~= -2
  i.e.
  (2^32 - 2000000) / 250 ~= 17171869

  Then you say:

  (2^32 - 2) * (1000000 / 250) = 17179869176000
  i.e.
  (2^32 * 1000000 - 2000000) / 250 = 17179869176000

  So you see where the factor 1000000 comes from (2^32, i.e. the
  wrap-around, being the dominant term here). The different digit is
  actually the result of the factor not being applied to the second
  summand.

  OTOH,
  (2^32 - 2 * 1000000) / 250 ~= 17171869

  i.e. in C:
  (uint32_t)-2 * (uint32_t)1000000 / (uint32_t)250 == 17171869

- In your previous mail you said the timeout was 790269982us most
  of the time?! You must have done something different there,
  because with the above assumptions (using jiffies, HZ=250, 32 bit)
  the maximum possible value of that calculation is
  (2^32 - 1) / 250 = 17179869.

- Side note: The (original) code might be improved by casting to
  uint64_t before the multiplication by 1000000. Without it, with 32
  bit long, the range of poll-sent is limited to ~4000 jiffies
  (which is still quite a bit, but if exceeded, makes things
  confusing as we see). With the cast, the range is at least 4000
  seconds (though it still doesn't help much in case of overflow).

- Instead of second-guessing I think it would be much easier to just
  debug-print the interesting variables to see what values they
  actually have, including get_cycles()/jiffies.

ec_device_poll is called shortly before the timeout message which
sets jiffies_poll, so I wonder how jiffies_sent can be newer
(larger), so the difference would be negative. So I suppose
jiffies_poll < jiffies_sent <= jiffies which means jiffies_sent was
set between the call to ec_device_poll and the timeout message.

But I'm not sure how this can happen: Either due to a race condition
e.g. with ecrt_master_send called by ec_cdev_ioctl_send or
ec_master_send_ext etc. (which may have been the case when I made
this particular change; I don't remember if I fixed the locking bugs
before or after it; but you have applied my locking patches and done
proper locking in your application, haven't you?) or indirectly
through device->poll() (that's some complicated code paths, can't
see it right now). To check the latter, you could debug-print
jiffies again after device->poll() in ec_device_poll.

I vaguely remember it might be something like this, so I was
probably wrong when I said the other day the additional checks were
about wrap-around. But indeed, as you pointed out, they're not
wrap-around-safe.

To fix this: If it's actually device->poll(), it might suffice to
update jiffies_poll/cycles_poll after, rather than before, it. If
it's a race condition, this needs to be fixed obviously.

If it's something else (which I don't see ATM), you could make it
wrap-around-safe by changing

  jiffies_poll > jiffies_sent

to

  jiffies_poll - jiffies_sent > timeout_jiffies

and likewise with cycles. (But I wouldn't recommend this without
finding the root cause of the problem.)

> Spoken slightly too soon.  Looks like the strange-timeout datagrams still
> occur, but now only if "deferring concurrent CoE access" occurred just
> prior.
> 
> Which is odd, as those datagrams are marked as EC_DATAGRAM_INVALID, and
> those don't get queued, and even if one somehow got onto the queue it should
> be ignored by the timeout loop as that explicitly checks the state.

Again, it would help if you just debug-print the state to check if
that's actually what happens.

> On a hunch, I noticed that one of the other changes in patch 11 was to make
> the timeout loop consider QUEUED datagrams in addition to SENT ones; I
> removed that change and it appeared to stop the spurious timeouts.  What
> prompted this change, and is it actually needed?

One reason would be patch #16 which means datagrams can be queued
and not sent for a while (instead of corrputing other frame data).
If so, the change would be in the wrong patch, of course. I'm not
100% sure at the moment if something in #11 doesn't also require it,
but I think #16 is reason enough, so I don't have to check #11 in
detail, do I?

(Needless to say I didn't develop the patches precisely in order;
I tried to separate them when I had a working version, might have
mis-assigned a few things; like with the problem of variables used
before being introduced as you reported.)

Regards,
Frank

-- 
Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359
Systems Programming, Software Development, IT Consulting