[etherlab-dev] ethercat-1.5: Various issues
Frank Heckenbach
f.heckenbach at fh-soft.de
Wed Jun 25 19:57:40 CEST 2014
> I did make some other intentional changes from the patch:
> - I changed the "protocol" parameter of ec_slave_datagram_to_buffer and
> ec_slave_buffer_to_datagram to uint8_t from uint16_t (since that seems more
> consistent).
OK. (I suppose you mean ec_slave_datagram_from_buffer instead of
ec_slave_buffer_to_datagram.)
> - I left out the cycles_poll > cycles_sent and jiffies_poll > jiffies_sent
> checks in the timeout checking, since as I noted before these would not be
> safe against wraparound.
>
> If I put the jiffies_poll > jiffies_sent check back in, I do not get these
> 17171869us timeouts; but I'm unconvinced it's safe to leave this check in.
>
> Incidentally, reversing the calculation:
> time_us = (unsigned int)((jiffies_poll - jiffies_sent) * 1000000 / HZ);
> where time_us = 17171869 and HZ = 250, suggests that (poll-sent) is -2
> jiffies (17171869 * 250 / 1000000 ~= -2), but then plugging that forwards
> through the formula I'm not sure why time_us != 4294959296 (aka -8000)
> instead. (I did notice that (uint32_t)-2 * (uint64_t)(1000000 / 250) =
> 17179869176000, which at least has the right digits but they're at the wrong
> magnitude.) I suspect I'm missing something fundamental (and obvious).
- Actually it's not the same digits, there's a slight (but
important) difference.
- Switching from wrap-around to real arithmetic makes things clearer
(calculating with wrap-around is very confusing; writing them out
allows for normal mathematical reasoning):
Your first observation says:
(17171869 * 250 - 2^32) / 1000000 ~= -2
i.e.
(2^32 - 2000000) / 250 ~= 17171869
Then you say:
(2^32 - 2) * (1000000 / 250) = 17179869176000
i.e.
(2^32 * 1000000 - 2000000) / 250 = 17179869176000
So you see where the factor 1000000 comes from (2^32, i.e. the
wrap-around, being the dominant term here). The different digit is
actually the result of the factor not being applied to the second
summand.
OTOH,
(2^32 - 2 * 1000000) / 250 ~= 17171869
i.e. in C:
(uint32_t)-2 * (uint32_t)1000000 / (uint32_t)250 == 17171869
- In your previous mail you said the timeout was 790269982us most
of the time?! You must have done something different there,
because with the above assumptions (using jiffies, HZ=250, 32 bit)
the maximum possible value of that calculation is
(2^32 - 1) / 250 = 17179869.
- Side note: The (original) code might be improved by casting to
uint64_t before the multiplication by 1000000. Without it, with 32
bit long, the range of poll-sent is limited to ~4000 jiffies
(which is still quite a bit, but if exceeded, makes things
confusing as we see). With the cast, the range is at least 4000
seconds (though it still doesn't help much in case of overflow).
- Instead of second-guessing I think it would be much easier to just
debug-print the interesting variables to see what values they
actually have, including get_cycles()/jiffies.
ec_device_poll is called shortly before the timeout message which
sets jiffies_poll, so I wonder how jiffies_sent can be newer
(larger), so the difference would be negative. So I suppose
jiffies_poll < jiffies_sent <= jiffies which means jiffies_sent was
set between the call to ec_device_poll and the timeout message.
But I'm not sure how this can happen: Either due to a race condition
e.g. with ecrt_master_send called by ec_cdev_ioctl_send or
ec_master_send_ext etc. (which may have been the case when I made
this particular change; I don't remember if I fixed the locking bugs
before or after it; but you have applied my locking patches and done
proper locking in your application, haven't you?) or indirectly
through device->poll() (that's some complicated code paths, can't
see it right now). To check the latter, you could debug-print
jiffies again after device->poll() in ec_device_poll.
I vaguely remember it might be something like this, so I was
probably wrong when I said the other day the additional checks were
about wrap-around. But indeed, as you pointed out, they're not
wrap-around-safe.
To fix this: If it's actually device->poll(), it might suffice to
update jiffies_poll/cycles_poll after, rather than before, it. If
it's a race condition, this needs to be fixed obviously.
If it's something else (which I don't see ATM), you could make it
wrap-around-safe by changing
jiffies_poll > jiffies_sent
to
jiffies_poll - jiffies_sent > timeout_jiffies
and likewise with cycles. (But I wouldn't recommend this without
finding the root cause of the problem.)
> Spoken slightly too soon. Looks like the strange-timeout datagrams still
> occur, but now only if "deferring concurrent CoE access" occurred just
> prior.
>
> Which is odd, as those datagrams are marked as EC_DATAGRAM_INVALID, and
> those don't get queued, and even if one somehow got onto the queue it should
> be ignored by the timeout loop as that explicitly checks the state.
Again, it would help if you just debug-print the state to check if
that's actually what happens.
> On a hunch, I noticed that one of the other changes in patch 11 was to make
> the timeout loop consider QUEUED datagrams in addition to SENT ones; I
> removed that change and it appeared to stop the spurious timeouts. What
> prompted this change, and is it actually needed?
One reason would be patch #16 which means datagrams can be queued
and not sent for a while (instead of corrputing other frame data).
If so, the change would be in the wrong patch, of course. I'm not
100% sure at the moment if something in #11 doesn't also require it,
but I think #16 is reason enough, so I don't have to check #11 in
detail, do I?
(Needless to say I didn't develop the patches precisely in order;
I tried to separate them when I had a working version, might have
mis-assigned a few things; like with the problem of variables used
before being introduced as you reported.)
Regards,
Frank
--
Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359
Systems Programming, Software Development, IT Consulting
More information about the Etherlab-dev
mailing list