[etherlab-dev] Testing 1.5 branch: several issues

Sat Jun 18 06:13:26 CEST 2011

Hi,

(I'm writing to both lists, users and dev, because I'm not sure
which one I should use. The description for dev says: "This list is
used for communication between EtherLab developers." which I'm not,
but my previous bug reports in the users list were ignored.)

I noticed there is a stable-1.5 branch now, so I decided to try it.
I found several problems, some of which I could fix, some not.

For reference, I'm using a 2.6.24-16-rtai kernel and an e1000
network interface.

- Including semaphore.h from master/ethernet.h needs a kernel
  version check, as is done in several other files
  (ethercat-1.5-header.patch).

- The e1000 driver has the same problem I reported for 1.4.0
  (http://lists.etherlab.org/pipermail/etherlab-users/2011/001190.html),
  and the same patch fixes it (ethercat-1.5-e1000.patch).

  A similar patch should probably also be applied to the other
  kernel versions of the e1000 driver.

- Also, the problem with the debug interface during RTAI PDO
  transfer still exists
  (http://lists.etherlab.org/pipermail/etherlab-users/2011/001205.html),
  although the behaviour is a little different: It doesn't give a
  "Kernel BUG" anymore, but "Default Trap Handler: vector 6: Suspend
  RT task f8840880" (and the cyclic task indeed gets suspended). But
  the same patch as before fixes it
  (ethercat-1.5-debug-disable.patch).

- "ethercat download" with a string type cuts the input string at
  the first space (but the size is given correctly, so for the rest
  of the string garbage is sent). This is due to the behaviour of
  ">>" and easily fixed using "read"
  (ethercat-1.5-string-download.patch).

- As soon as I try to use EoE I get an error in syslog (usually
  already when I just start the EoE devices the error appears every
  few seconds, but certainly when I do anything with EoE such as
  just "ping"):

    BUG: scheduling while atomic: swapper/0/0x10000100

    Pid: 0, comm: swapper Tainted: GF       (2.6.24-16-rtai #1)
    EFLAGS: 00000246 CPU: 0
    EIP is at default_idle+0x27/0x39
    EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
    ESI: 00000000 EDI: 00000000 EBP: 007c4007 ESP: c033c140
     DS: 0000 ES: 0000 FS: 0000 GS: 0000 SS: 0068
    CR0: 8005003b CR2: 080f6008 CR3: 1f82a000 CR4: 000006d0
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: ffff0ff0 DR7: 00000400
    i8042_panic_blink+0x0/0x129

  When I comment out (just for debugging) the part from "down" to
  "up" inclusively in ec_eoedev_tx(), the error goes away (but, of
  course, EoE is non-functional). When I just add the "down" and
  "up" statements back in (nothing in between), the error reappears.

  So I suppose that the schedule() that might be called by down()
  actually causes the problem because apparently ec_eoedev_tx() is
  already called from atomic context (via hard_start_xmit), which
  seems to be the case according to a quick Google search. When I
  compared it with 1.4.0 which didn't have this problem, I noticed
  that 1.4.0 used a spinlock rather than a semaphore here. So I
  reverted the code back to a spinlock, and the problem went away
  and EoE became usable. I don't know what was the motivation for
  replacing the spinlock with a semaphore here, but at least in this
  case, it seems to be wrong (ethercat-1.5-eoe.patch).

- examples/mini/ gives the same syslog error ("scheduling while
  atomic") immediately upon "insmod". The error persists if I remove
  everything from cyclic_task() except for a single down() and up()
  call (but disappears if I remove these statements too).

  I see that the cyclic task is registered with add_timer(). Indeed
  according to http://www.makelinux.net/ldd3/chp-7-sect-4, such
  timer callbacks run in atomic context (interrupt context in fact)
  and therefore "Semaphores also must not be used since they can
  sleep."

  A solution might be to also revert the semaphore back to a
  spinlock as it was in 1.4.0. However, I'm not sure if any of the
  other functions called by cyclic_task() can sleep (or do anything
  else that's forbidden in interrupt context). If so, it might be
  easier to avoid the timer callback altogether and convert it to a
  kernel thread with sleeps or so.

  The same probably applies to examples/tty/ which I didn't test and
  possibly to tty/module.c.

- I wonder whether the use of RTAI semaphores in the master
  callbacks in the examples (e.g. examples/rtai/) is safe. (Since my
  own application also uses RTAI, in the same style as those
  examples, this is an important question for me, not just
  hypothetical.) I found this message:
  http://blog.gmane.org/gmane.linux.real-time.rtai/month=20020801
  "Any nonblocking function can be used from Linux, typically
  rt_sem_signal and rt_task_resume". The message is a bit dated, and
  he doesn't strictly say that blocking functions cannot be used
  from Linux (i.e., non-RTAI kernel modules), but it might be
  implied. Therefore the use of rt_sem_wait() in the callbacks might
  be problematic.

  I admit I don't fully understand the differences between normal
  and RTAI semaphores. I gather their function is identical, they
  just interact with different schedulers (Linux vs. RTAI) when
  necessary, i.e. when there is contention. Since the callbacks are
  called from non-RTAI code (namely the EoE kernel task), this would
  lead to problems in this case.

  I thought that a solution would be to use the non-blocking
  rt_sem_wait_if(), but I see that it also accesses RT_CURRENT, so
  it might also be problematic. But I really have problems
  understanding the RTAI documentation (it seems to be written in a
  confusing way to me and often in bad English), so I'm not
  completely sure what's allowed.

  I also thought of a spinlock, but AFAICS it wouldn't work on a
  single-core machine or whenever the cyclic task is scheduled on
  the same CPU as the EoE task.

  A safe alternative then might be to use a simple atomic counter to
  implement our own "semaphore" which never blocks (i.e., only
  provides a "try-lock" method, so the "unlock" method never needs
  to wake up any waiting tasks, so it would be completely
  independent of any scheduler). Since the callbacks are allowed to
  do nothing if inconvenient, this seems to be valid for them. And
  for the cyclic task, well, if the master is locked when the cycle
  should run, we have lost anyway (the t_critical check should
  prevent this from ever happening), so the best we can do then is
  probably to log the problem and skip to the next cycle (or wait a
  fraction of the cycle time and try again). I can try to implement
  this, but first I'd like to know if this is actually needed, or if
  there's a better solution, or if RTAI semaphores are actually safe
  to use this way (if so, I'd appreciate a reference that confirms
  this because so far I don't have this impression).

- Related to this: I see that the example code acquires and releases
  the semaphore several times during one cycle (at least twice, more
  often if the optional checks are done). I'm not sure that's a good
  idea. Even if it doesn't need the master for a moment, I don't
  think we want to allow another task to grab it until the cycle is
  finished. (In general, fine-grained locking is a good idea, of
  course, but here I think the hard-realtime constraints are more
  important.)

  Note that this is not prevented by the t_critical
  check, since t_last_cycle is updated at the beginning of run(). If
  it was updated at the end, which I'd also suggest, it should not
  be possible for another task (via the callbacks) to get
  master_sem, but then it's no problem for the cyclic task to hold
  it for the whole cycle anyway.

  The patch (ethercat-1.5-rtai-lock.patch, to be applied after
  ethercat-1.5-debug-disable.patch) changes these two things,
  without changing the RTAI semaphores yet.

  This and the previous point probably also apply to
  examples/dc_rtai/ which I didn't test.

- In my application I need the ability to read/write arbitrary SDOs
  (from the non-realtime, but kernel-module part of the code, so
  going through the cdev would be awkward; however at a time when
  the master is already activated and running).

  I thought I could just do an
  ecrt_slave_config_create_sdo_request() when needed. I'm not sure
  if this function is supposed to be used while the master is
  running (I couldn't find a statement that forbids it anyway).
  However, there is no corresponding "delete" function, so used-up
  SDO requests would accumulate and leak.

  I see three possible solutions:

  - Implement ecrt_slave_config_delete_sdo_request() or such. Is
    there more to it than basically doing
    "list_del(&req->list, &sc->sdo_requests);" after appropriate
    checks that the request is not busy etc.? And if done, would it
    be possible/reasonable to use
    ecrt_slave_config_create_sdo_request() while the master is
    running?

  - Allow changing the index and subindex of an existing request (so
    I could create some requests on startup and reuse them for
    arbitrary SDOs -- I only need a fixed number of them
    simultaneously). This seems to match the TODO list item: "Change
    SDO index at runtime for SDO request." Is there more to it than
    calling ec_sdo_request_address() (again, after appropriate
    checks)?

  - Implement ecrt_master_sdo_{down,up}load() also for (non-RT)
    kernel access. Of course, this could be implemented simply on
    top of ecrt_slave_config_create_sdo_request() etc. if either of
    the previous two solutions were implemented, but then the master
    has to call ecrt_sdo_request_read() etc. (as in read_sdo() in
    examples/mini/mini.c), so it would have to know about the SDO
    requests which might be a problem in the general case (in my own
    application probably not -- I know which SDO requests I have and
    can let my master know about them).

    However, I see that this is apparently not required for the way
    the cdev does SDO transfers, so I tried to adjust this code for
    in-kernel use and implemented ecrt_master_sdo_{down,up}load() in
    slave_config.c. I'm not sure if it's intended to be used this
    way, and I wonder especially about the usage of
    ec_master_find_slave() (apparently the user-space code uses
    different ways to identify a slave -- although the tool accepts
    alias and position as command-line arguments, it only passes the
    position to the ioctl; however, in the ec_slave_config_t struct
    used in the kernel we have both alias and position; but since
    ec_master_find_slave() accepts alias and position, this might be
    alright). The code is not commented etc., since I'm not yet sure
    if that's the right way to go
    (ethercat-1.5-sdo-up-download.patch).

    When testing it I found that it works most of the time, but
    sometimes I get errors like the following:

    EtherCAT ERROR 0-0: Received upload response for wrong SDO (...)

    EtherCAT ERROR 0-0: Reception of CoE upload response failed: No response.

    EtherCAT ERROR 0-0: Reception of CoE download response failed: No response.

    This might be another case of mailbox contention (I'll talk
    about this in another mail, since it seems to be a major topic
    of its own), this time between the cyclic task's SDO access and
    the new functions called from a normal kernel task (and
    therefore would probably also occur between the former and the
    cdev, since it uses the same mechanism).

PS: Of course, all my patches are released under the GPL, version 2
or any later version, and I hope they will be integrated in future
releases.

Regards,
Frank

-- 
Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
Systemprogrammierung, EDV-Beratung
Stubenlohstr. 6, 91052 Erlangen, Deutschland
Tel.: +49-9131-21359
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-header.patch
Type: text/x-diff
Size: 422 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0021.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-e1000.patch
Type: text/x-diff
Size: 874 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0022.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-debug-disable.patch
Type: text/x-diff
Size: 3201 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0023.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-string-download.patch
Type: text/x-diff
Size: 473 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0024.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-eoe.patch
Type: text/x-diff
Size: 2660 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0025.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-rtai-lock.patch
Type: text/x-diff
Size: 2569 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0026.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5-sdo-up-download.patch
Type: text/x-diff
Size: 5853 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20110618/d52047b4/attachment-0027.patch>