[etherlab-dev] Problems with Xenomai

Thu Sep 29 01:21:22 CEST 2016

On 29 September 2016 03:07 quoth Christoph Schröder,
> #1.)
> Starting with the tarball release 1.5.2 and encountered a problem with
> ecrt_master_reference_clock_time which led to a segmentation fault. My
> DC config here is basically the same as in the rtai_rtdm_dc example with
> minor fixes since I am not using RTAI. The rest is based on the xenomai
> example. The problem seems to be fixed in the mercurial repo (tested
> 5a70ffc4644b for later tests of the patch queue) and I would like to know
> which commit fixed this issue. Unfortunately I can't find the point where
the
> release 1.5.2 was taken from since the changelog messages do not
> correspond to the commit messages and there is no Label for the release.
> 
> This is my debugging output:
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff7fd8700 (LWP 4389)] 0x00007ffff68d53ca in
> vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
> (gdb) backtrace
> #0  0x00007ffff68d53ca in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007ffff68daa00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x00007ffff68d553e in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x00007ffff68e0188 in fprintf () from /lib/x86_64-linux-gnu/libc.so.6
> #4  0x00007ffff7bd8944 in ecrt_master_reference_clock_time (
>      master=<optimized out>, time=<optimized out>) at master.c:717

Given that stack trace, and that it works on default but not 1.5.2, then
most likely the commit that worked around the issue for you was
https://sourceforge.net/p/etherlabmaster/code/ci/3affe9cd0b66fe55ef8e8060778
ef9461a8204a0.

Having said that, given that the only reason I can think of that this would
segfault is if strerror returned NULL or an invalid pointer, it suggests
that you might have a broken or badly configured libc.  If you're building
the libc yourself, make sure that you're using an up-to-date version and
haven't excluded the strerror text.

Another possibility is that if you were concurrently calling strerror() on
another thread (and your libc doesn't implement strerror in a thread-local
manner) then it could have corrupted the buffer.  Most likely another patch
would be required to resolve this "properly", although one workaround for
this is to avoid calling ecrt_* APIs from more than one thread.

Although I suppose since you're linking to RTDM it's possible that
strerror() is coming from there rather than the libc; I'm not exactly sure
how RTAI/Xenomai work.  Or possibly that in that context it could be that
the fprintf(strerr) itself is failing -- but this isn't new code so I would
have thought the problem would have come up earlier if that were the case.

I'm not sure exactly which commit 1.5.2 is based on, but it will be one of
the ones in the "stable-1.5" branch.  Everything on "default" is newer than
that.

> #2.)
> I did some minor tests with the patch queue and got some bad system
> freezes with the xenomai example. I could locate the patch that seems to
> cause the system freezes:
> 0011-Master-locks-to-avoid-corrupted-datagram-queue.patch
> The only notable thing I could see in the kernel log is that the slaves
went
> back to PREOP. The Xenomai task was still running and hanging at some
point
> of the cycle (I placed an rt_printf in the cycle which should have printed
the
> cycle_counter value every other second).
> The patch series seems to work if I apply the patches up to 0010-Sdo-
> directory-now-only-fetched-on-request.patch. Is this reproduceable for
> you?

I'm not sure about this as I don't use Xenomai myself.  That particular
patch was authored by Knud Baastrup, so I've added him to the email chain
directly just in case.  If I recall correctly I think he, like myself, was
using PREEMPT_RT so it's possible that this has not been tested with
Xenomai.

Do you have locking on the Xenomai side as well?  Do you call ecrt APIs from
multiple Xenomai tasks?  I believe the patch assumes that there is no
external locking between tasks, so you might be running into deadlocks
depending on the order in which things happen.

Using Linux locks between Xenomai tasks is probably not ideal, but I would
have expected that it ought to work as this occurs in other places as well.

> #3.)
> In both versions (1.5.2 and repository 5a70ffc4644b) I get a lost frame at
> startup. Is this anything to worry about?
> [Wed Sep 28 15:24:51 2016] EtherCAT 0: Master thread exited.
> [Wed Sep 28 15:24:51 2016] EtherCAT 0: Starting EtherCAT-OP thread.
> [Wed Sep 28 15:24:51 2016] EtherCAT WARNING 0: 1 datagram UNMATCHED!
> [Wed Sep 28 15:24:52 2016] EtherCAT 0: Domain 0: Working counter changed
> to 2/3.
> [Wed Sep 28 15:24:52 2016] EtherCAT 0: Slave states on main device: OP.

I don't think this is anything to worry about; it's probably just that the
idle thread sent a request and then exited before the reply came back; the
reply then sat in the buffers until the OP thread started but it had either
timed out or reset the state machines in the meantime so it was no longer
expected.

> #4.)
> Will there be a new release aka a new version of the EtherCAT master in
the
> near future based on the patches?

I'm hoping so, but it's not up to me. :)  More feedback and sorting out
things like these Xenomai issues you've encountered may help to move towards
that though.