[etherlab-dev] Testing 1.5 branch: several issues
Frank Heckenbach
f.heckenbach at fh-soft.de
Sat Jun 18 06:13:26 CEST 2011
Hi,
(I'm writing to both lists, users and dev, because I'm not sure
which one I should use. The description for dev says: "This list is
used for communication between EtherLab developers." which I'm not,
but my previous bug reports in the users list were ignored.)
I noticed there is a stable-1.5 branch now, so I decided to try it.
I found several problems, some of which I could fix, some not.
For reference, I'm using a 2.6.24-16-rtai kernel and an e1000
network interface.
- Including semaphore.h from master/ethernet.h needs a kernel
version check, as is done in several other files
(ethercat-1.5-header.patch).
- The e1000 driver has the same problem I reported for 1.4.0
(http://lists.etherlab.org/pipermail/etherlab-users/2011/001190.html),
and the same patch fixes it (ethercat-1.5-e1000.patch).
A similar patch should probably also be applied to the other
kernel versions of the e1000 driver.
- Also, the problem with the debug interface during RTAI PDO
transfer still exists
(http://lists.etherlab.org/pipermail/etherlab-users/2011/001205.html),
although the behaviour is a little different: It doesn't give a
"Kernel BUG" anymore, but "Default Trap Handler: vector 6: Suspend
RT task f8840880" (and the cyclic task indeed gets suspended). But
the same patch as before fixes it
(ethercat-1.5-debug-disable.patch).
- "ethercat download" with a string type cuts the input string at
the first space (but the size is given correctly, so for the rest
of the string garbage is sent). This is due to the behaviour of
">>" and easily fixed using "read"
(ethercat-1.5-string-download.patch).
- As soon as I try to use EoE I get an error in syslog (usually
already when I just start the EoE devices the error appears every
few seconds, but certainly when I do anything with EoE such as
just "ping"):
BUG: scheduling while atomic: swapper/0/0x10000100
Pid: 0, comm: swapper Tainted: GF (2.6.24-16-rtai #1)
EFLAGS: 00000246 CPU: 0
EIP is at default_idle+0x27/0x39
EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: 007c4007 ESP: c033c140
DS: 0000 ES: 0000 FS: 0000 GS: 0000 SS: 0068
CR0: 8005003b CR2: 080f6008 CR3: 1f82a000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
i8042_panic_blink+0x0/0x129
When I comment out (just for debugging) the part from "down" to
"up" inclusively in ec_eoedev_tx(), the error goes away (but, of
course, EoE is non-functional). When I just add the "down" and
"up" statements back in (nothing in between), the error reappears.
So I suppose that the schedule() that might be called by down()
actually causes the problem because apparently ec_eoedev_tx() is
already called from atomic context (via hard_start_xmit), which
seems to be the case according to a quick Google search. When I
compared it with 1.4.0 which didn't have this problem, I noticed
that 1.4.0 used a spinlock rather than a semaphore here. So I
reverted the code back to a spinlock, and the problem went away
and EoE became usable. I don't know what was the motivation for
replacing the spinlock with a semaphore here, but at least in this
case, it seems to be wrong (ethercat-1.5-eoe.patch).
- examples/mini/ gives the same syslog error ("scheduling while
atomic") immediately upon "insmod". The error persists if I remove
everything from cyclic_task() except for a single down() and up()
call (but disappears if I remove these statements too).
I see that the cyclic task is registered with add_timer(). Indeed
according to http://www.makelinux.net/ldd3/chp-7-sect-4, such
timer callbacks run in atomic context (interrupt context in fact)
and therefore "Semaphores also must not be used since they can
sleep."
A solution might be to also revert the semaphore back to a
spinlock as it was in 1.4.0. However, I'm not sure if any of the
other functions called by cyclic_task() can sleep (or do anything
else that's forbidden in interrupt context). If so, it might be
easier to avoid the timer callback altogether and convert it to a
kernel thread with sleeps or so.
The same probably applies to examples/tty/ which I didn't test and
possibly to tty/module.c.
- I wonder whether the use of RTAI semaphores in the master
callbacks in the examples (e.g. examples/rtai/) is safe. (Since my
own application also uses RTAI, in the same style as those
examples, this is an important question for me, not just
hypothetical.) I found this message:
http://blog.gmane.org/gmane.linux.real-time.rtai/month=20020801
"Any nonblocking function can be used from Linux, typically
rt_sem_signal and rt_task_resume". The message is a bit dated, and
he doesn't strictly say that blocking functions cannot be used
from Linux (i.e., non-RTAI kernel modules), but it might be
implied. Therefore the use of rt_sem_wait() in the callbacks might
be problematic.
I admit I don't fully understand the differences between normal
and RTAI semaphores. I gather their function is identical, they
just interact with different schedulers (Linux vs. RTAI) when
necessary, i.e. when there is contention. Since the callbacks are
called from non-RTAI code (namely the EoE kernel task), this would
lead to problems in this case.
I thought that a solution would be to use the non-blocking
rt_sem_wait_if(), but I see that it also accesses RT_CURRENT, so
it might also be problematic. But I really have problems
understanding the RTAI documentation (it seems to be written in a
confusing way to me and often in bad English), so I'm not
completely sure what's allowed.
I also thought of a spinlock, but AFAICS it wouldn't work on a
single-core machine or whenever the cyclic task is scheduled on
the same CPU as the EoE task.
A safe alternative then might be to use a simple atomic counter to
implement our own "semaphore" which never blocks (i.e., only
provides a "try-lock" method, so the "unlock" method never needs
to wake up any waiting tasks, so it would be completely
independent of any scheduler). Since the callbacks are allowed to
do nothing if inconvenient, this seems to be valid for them. And
for the cyclic task, well, if the master is locked when the cycle
should run, we have lost anyway (the t_critical check should
prevent this from ever happening), so the best we can do then is
probably to log the problem and skip to the next cycle (or wait a
fraction of the cycle time and try again). I can try to implement
this, but first I'd like to know if this is actually needed, or if
there's a better solution, or if RTAI semaphores are actually safe
to use this way (if so, I'd appreciate a reference that confirms
this because so far I don't have this impression).
- Related to this: I see that the example code acquires and releases
the semaphore several times during one cycle (at least twice, more
often if the optional checks are done). I'm not sure that's a good
idea. Even if it doesn't need the master for a moment, I don't
think we want to allow another task to grab it until the cycle is
finished. (In general, fine-grained locking is a good idea, of
course, but here I think the hard-realtime constraints are more
important.)
Note that this is not prevented by the t_critical
check, since t_last_cycle is updated at the beginning of run(). If
it was updated at the end, which I'd also suggest, it should not
be possible for another task (via the callbacks) to get
master_sem, but then it's no problem for the cyclic task to hold
it for the whole cycle anyway.
The patch (ethercat-1.5-rtai-lock.patch, to be applied after
ethercat-1.5-debug-disable.patch) changes these two things,
without changing the RTAI semaphores yet.
This and the previous point probably also apply to
examples/dc_rtai/ which I didn't test.
- In my application I need the ability to read/write arbitrary SDOs
(from the non-realtime, but kernel-module part of the code, so
going through the cdev would be awkward; however at a time when
the master is already activated and running).
I thought I could just do an
ecrt_slave_config_create_sdo_request() when needed. I'm not sure
if this function is supposed to be used while the master is
running (I couldn't find a statement that forbids it anyway).
However, there is no corresponding "delete" function, so used-up
SDO requests would accumulate and leak.
I see three possible solutions:
- Implement ecrt_slave_config_delete_sdo_request() or such. Is
there more to it than basically doing
"list_del(&req->list, &sc->sdo_requests);" after appropriate
checks that the request is not busy etc.? And if done, would it
be possible/reasonable to use
ecrt_slave_config_create_sdo_request() while the master is
running?
- Allow changing the index and subindex of an existing request (so
I could create some requests on startup and reuse them for
arbitrary SDOs -- I only need a fixed number of them
simultaneously). This seems to match the TODO list item: "Change
SDO index at runtime for SDO request." Is there more to it than
calling ec_sdo_request_address() (again, after appropriate
checks)?
- Implement ecrt_master_sdo_{down,up}load() also for (non-RT)
kernel access. Of course, this could be implemented simply on
top of ecrt_slave_config_create_sdo_request() etc. if either of
the previous two solutions were implemented, but then the master
has to call ecrt_sdo_request_read() etc. (as in read_sdo() in
examples/mini/mini.c), so it would have to know about the SDO
requests which might be a problem in the general case (in my own
application probably not -- I know which SDO requests I have and
can let my master know about them).
However, I see that this is apparently not required for the way
the cdev does SDO transfers, so I tried to adjust this code for
in-kernel use and implemented ecrt_master_sdo_{down,up}load() in
slave_config.c. I'm not sure if it's intended to be used this
way, and I wonder especially about the usage of
ec_master_find_slave() (apparently the user-space code uses
different ways to identify a slave -- although the tool accepts
alias and position as command-line arguments, it only passes the
position to the ioctl; however, in the ec_slave_config_t struct
used in the kernel we have both alias and position; but since
ec_master_find_slave() accepts alias and position, this might be
alright). The code is not commented etc., since I'm not yet sure
if that's the right way to go
(ethercat-1.5-sdo-up-download.patch).
When testing it I found that it works most of the time, but
sometimes I get errors like the following:
EtherCAT ERROR 0-0: Received upload response for wrong SDO (...)
EtherCAT ERROR 0-0: Reception of CoE upload response failed: No response.
EtherCAT ERROR 0-0: Reception of CoE download response failed: No response.
This might be another case of mailbox contention (I'll talk
about this in another mail, since it seems to be a major topic
of its own), this time between the cyclic task's SDO access and
the new functions called from a normal kernel task (and
therefore would probably also occur between the former and the
cdev, since it uses the same mechanism).
PS: Of course, all my patches are released under the GPL, version 2
or any later version, and I hope they will be integrated in future
releases.
Regards,
Frank
--
Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
Systemprogrammierung, EDV-Beratung
Stubenlohstr. 6, 91052 Erlangen, Deutschland
Tel.: +49-9131-21359
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-header.patch
Type: text/x-diff
Size: 422 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0021.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-e1000.patch
Type: text/x-diff
Size: 874 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0022.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-debug-disable.patch
Type: text/x-diff
Size: 3201 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0023.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-string-download.patch
Type: text/x-diff
Size: 473 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0024.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-eoe.patch
Type: text/x-diff
Size: 2660 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0025.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-rtai-lock.patch
Type: text/x-diff
Size: 2569 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0026.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-sdo-up-download.patch
Type: text/x-diff
Size: 5853 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0027.patch>
More information about the Etherlab-dev
mailing list