[etherlab-users] Testing 1.5 branch: several issues

Florian Pose fp at igh-essen.com
Mon Jun 27 15:29:45 CEST 2011


Hi Frank,

On Sat, Jun 18, 2011 at 06:13:26AM +0200, Frank Heckenbach wrote:
> (I'm writing to both lists, users and dev, because I'm not sure
> which one I should use. The description for dev says: "This list is
> used for communication between EtherLab developers." which I'm not,
> but my previous bug reports in the users list were ignored.)
> 
> I noticed there is a stable-1.5 branch now, so I decided to try it.
> I found several problems, some of which I could fix, some not.

First of all, thank you for your work. Sorry, that your mail was not
responded earlier, but we at IgH are quite busy at the moment.

> For reference, I'm using a 2.6.24-16-rtai kernel and an e1000
> network interface.
> 
> - Including semaphore.h from master/ethernet.h needs a kernel
>   version check, as is done in several other files
>   (ethercat-1.5-header.patch).

Thanks, I'll have a look at it and include it.

> - The e1000 driver has the same problem I reported for 1.4.0
>   (http://lists.etherlab.org/pipermail/etherlab-users/2011/001190.html),
>   and the same patch fixes it (ethercat-1.5-e1000.patch).
> 
>   A similar patch should probably also be applied to the other
>   kernel versions of the e1000 driver.
> 
> - Also, the problem with the debug interface during RTAI PDO
>   transfer still exists
>   (http://lists.etherlab.org/pipermail/etherlab-users/2011/001205.html),
>   although the behaviour is a little different: It doesn't give a
>   "Kernel BUG" anymore, but "Default Trap Handler: vector 6: Suspend
>   RT task f8840880" (and the cyclic task indeed gets suspended). But
>   the same patch as before fixes it
>   (ethercat-1.5-debug-disable.patch).

Im not sure if it makes sense to enable the debug interfaces, if not all
data are forwarded to it. Ome might get a wrong opinion about a system
when there are the most frequent frames missing.

> - "ethercat download" with a string type cuts the input string at
>   the first space (but the size is given correctly, so for the rest
>   of the string garbage is sent). This is due to the behaviour of
>   ">>" and easily fixed using "read"
>   (ethercat-1.5-string-download.patch).

Ok, you're right. Thanks.
 
> - As soon as I try to use EoE I get an error in syslog (usually
>   already when I just start the EoE devices the error appears every
>   few seconds, but certainly when I do anything with EoE such as
>   just "ping"):
> 
>     BUG: scheduling while atomic: swapper/0/0x10000100
> 
>     Pid: 0, comm: swapper Tainted: GF       (2.6.24-16-rtai #1)
>     EFLAGS: 00000246 CPU: 0
>     EIP is at default_idle+0x27/0x39
>     EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
>     ESI: 00000000 EDI: 00000000 EBP: 007c4007 ESP: c033c140
>      DS: 0000 ES: 0000 FS: 0000 GS: 0000 SS: 0068
>     CR0: 8005003b CR2: 080f6008 CR3: 1f82a000 CR4: 000006d0
>     DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
>     DR6: ffff0ff0 DR7: 00000400
>     i8042_panic_blink+0x0/0x129
> 
>   When I comment out (just for debugging) the part from "down" to
>   "up" inclusively in ec_eoedev_tx(), the error goes away (but, of
>   course, EoE is non-functional). When I just add the "down" and
>   "up" statements back in (nothing in between), the error reappears.
> 
>   So I suppose that the schedule() that might be called by down()
>   actually causes the problem because apparently ec_eoedev_tx() is
>   already called from atomic context (via hard_start_xmit), which
>   seems to be the case according to a quick Google search. When I
>   compared it with 1.4.0 which didn't have this problem, I noticed
>   that 1.4.0 used a spinlock rather than a semaphore here. So I
>   reverted the code back to a spinlock, and the problem went away
>   and EoE became usable. I don't know what was the motivation for
>   replacing the spinlock with a semaphore here, but at least in this
>   case, it seems to be wrong (ethercat-1.5-eoe.patch).

Ok, I'll check this and include your patch.

> - examples/mini/ gives the same syslog error ("scheduling while
>   atomic") immediately upon "insmod". The error persists if I remove
>   everything from cyclic_task() except for a single down() and up()
>   call (but disappears if I remove these statements too).
> 
>   I see that the cyclic task is registered with add_timer(). Indeed
>   according to http://www.makelinux.net/ldd3/chp-7-sect-4, such
>   timer callbacks run in atomic context (interrupt context in fact)
>   and therefore "Semaphores also must not be used since they can
>   sleep."
> 
>   A solution might be to also revert the semaphore back to a
>   spinlock as it was in 1.4.0. However, I'm not sure if any of the
>   other functions called by cyclic_task() can sleep (or do anything
>   else that's forbidden in interrupt context). If so, it might be
>   easier to avoid the timer callback altogether and convert it to a
>   kernel thread with sleeps or so.
> 
>   The same probably applies to examples/tty/ which I didn't test and
>   possibly to tty/module.c.

If the timer callback runs in interrupt context, it will be probably the
best solution tu use a spinlock.

> - I wonder whether the use of RTAI semaphores in the master
>   callbacks in the examples (e.g. examples/rtai/) is safe. (Since my
>   own application also uses RTAI, in the same style as those
>   examples, this is an important question for me, not just
>   hypothetical.) I found this message:
>   http://blog.gmane.org/gmane.linux.real-time.rtai/month=20020801
>   "Any nonblocking function can be used from Linux, typically
>   rt_sem_signal and rt_task_resume". The message is a bit dated, and
>   he doesn't strictly say that blocking functions cannot be used
>   from Linux (i.e., non-RTAI kernel modules), but it might be
>   implied. Therefore the use of rt_sem_wait() in the callbacks might
>   be problematic.
> 
>   I admit I don't fully understand the differences between normal
>   and RTAI semaphores. I gather their function is identical, they
>   just interact with different schedulers (Linux vs. RTAI) when
>   necessary, i.e. when there is contention. Since the callbacks are
>   called from non-RTAI code (namely the EoE kernel task), this would
>   lead to problems in this case.
> 
>   I thought that a solution would be to use the non-blocking
>   rt_sem_wait_if(), but I see that it also accesses RT_CURRENT, so
>   it might also be problematic. But I really have problems
>   understanding the RTAI documentation (it seems to be written in a
>   confusing way to me and often in bad English), so I'm not
>   completely sure what's allowed.
> 
>   I also thought of a spinlock, but AFAICS it wouldn't work on a
>   single-core machine or whenever the cyclic task is scheduled on
>   the same CPU as the EoE task.
> 
>   A safe alternative then might be to use a simple atomic counter to
>   implement our own "semaphore" which never blocks (i.e., only
>   provides a "try-lock" method, so the "unlock" method never needs
>   to wake up any waiting tasks, so it would be completely
>   independent of any scheduler). Since the callbacks are allowed to
>   do nothing if inconvenient, this seems to be valid for them. And
>   for the cyclic task, well, if the master is locked when the cycle
>   should run, we have lost anyway (the t_critical check should
>   prevent this from ever happening), so the best we can do then is
>   probably to log the problem and skip to the next cycle (or wait a
>   fraction of the cycle time and try again). I can try to implement
>   this, but first I'd like to know if this is actually needed, or if
>   there's a better solution, or if RTAI semaphores are actually safe
>   to use this way (if so, I'd appreciate a reference that confirms
>   this because so far I don't have this impression).

Did you encounter any problems with the RTAI example? It is rarely
modified and uses this mechanism quite for a while. We never saw aqny
problems concerning the locking mechanism...

> - Related to this: I see that the example code acquires and releases
>   the semaphore several times during one cycle (at least twice, more
>   often if the optional checks are done). I'm not sure that's a good
>   idea. Even if it doesn't need the master for a moment, I don't
>   think we want to allow another task to grab it until the cycle is
>   finished. (In general, fine-grained locking is a good idea, of
>   course, but here I think the hard-realtime constraints are more
>   important.)
> 
>   Note that this is not prevented by the t_critical
>   check, since t_last_cycle is updated at the beginning of run(). If
>   it was updated at the end, which I'd also suggest, it should not
>   be possible for another task (via the callbacks) to get
>   master_sem, but then it's no problem for the cyclic task to hold
>   it for the whole cycle anyway.
> 
>   The patch (ethercat-1.5-rtai-lock.patch, to be applied after
>   ethercat-1.5-debug-disable.patch) changes these two things,
>   without changing the RTAI semaphores yet.
> 
>   This and the previous point probably also apply to
>   examples/dc_rtai/ which I didn't test.

I think you're right. I'll check and include that one.

> - In my application I need the ability to read/write arbitrary SDOs
>   (from the non-realtime, but kernel-module part of the code, so
>   going through the cdev would be awkward; however at a time when
>   the master is already activated and running).
> 
>   I thought I could just do an
>   ecrt_slave_config_create_sdo_request() when needed. I'm not sure
>   if this function is supposed to be used while the master is
>   running (I couldn't find a statement that forbids it anyway).
>   However, there is no corresponding "delete" function, so used-up
>   SDO requests would accumulate and leak.
> 
>   I see three possible solutions:
> 
>   - Implement ecrt_slave_config_delete_sdo_request() or such. Is
>     there more to it than basically doing
>     "list_del(&req->list, &sc->sdo_requests);" after appropriate
>     checks that the request is not busy etc.? And if done, would it
>     be possible/reasonable to use
>     ecrt_slave_config_create_sdo_request() while the master is
>     running?
> 
>   - Allow changing the index and subindex of an existing request (so
>     I could create some requests on startup and reuse them for
>     arbitrary SDOs -- I only need a fixed number of them
>     simultaneously). This seems to match the TODO list item: "Change
>     SDO index at runtime for SDO request." Is there more to it than
>     calling ec_sdo_request_address() (again, after appropriate
>     checks)?

No, it's as easy as that. The reason for holding that back was just to
avoid another interface change in 1.5. Give it a try.

>   - Implement ecrt_master_sdo_{down,up}load() also for (non-RT)
>     kernel access. Of course, this could be implemented simply on
>     top of ecrt_slave_config_create_sdo_request() etc. if either of
>     the previous two solutions were implemented, but then the master
>     has to call ecrt_sdo_request_read() etc. (as in read_sdo() in
>     examples/mini/mini.c), so it would have to know about the SDO
>     requests which might be a problem in the general case (in my own
>     application probably not -- I know which SDO requests I have and
>     can let my master know about them).

I personally dislike the SDO handler interface. My idea is to drop it
completely and just use the (blocking) ad-hoc SDO transfer functions.
SDO transfers are asynchronous anyway, so perhaps its the best way to
let the user decide , how to cope with the blocking (perhaps spawn a
special asynchronous thread for this). For this, also a
slave-config-based transfer functions had to be implemented.

>     However, I see that this is apparently not required for the way
>     the cdev does SDO transfers, so I tried to adjust this code for
>     in-kernel use and implemented ecrt_master_sdo_{down,up}load() in
>     slave_config.c. I'm not sure if it's intended to be used this
>     way, and I wonder especially about the usage of
>     ec_master_find_slave() (apparently the user-space code uses
>     different ways to identify a slave -- although the tool accepts
>     alias and position as command-line arguments, it only passes the
>     position to the ioctl; however, in the ec_slave_config_t struct
>     used in the kernel we have both alias and position; but since
>     ec_master_find_slave() accepts alias and position, this might be
>     alright). The code is not commented etc., since I'm not yet sure
>     if that's the right way to go
>     (ethercat-1.5-sdo-up-download.patch).
> 
>     When testing it I found that it works most of the time, but
>     sometimes I get errors like the following:
> 
>     EtherCAT ERROR 0-0: Received upload response for wrong SDO (...)
> 
>     EtherCAT ERROR 0-0: Reception of CoE upload response failed: No response.
> 
>     EtherCAT ERROR 0-0: Reception of CoE download response failed: No response.
> 
>     This might be another case of mailbox contention (I'll talk
>     about this in another mail, since it seems to be a major topic
>     of its own), this time between the cyclic task's SDO access and
>     the new functions called from a normal kernel task (and
>     therefore would probably also occur between the former and the
>     cdev, since it uses the same mechanism).
> 
> PS: Of course, all my patches are released under the GPL, version 2
> or any later version, and I hope they will be integrated in future
> releases.

Thanks. As I told you, I am busy in a quite large project at the moment
and will unfortunately not have the time to work on this in the next
weeks. But I'll try to do it along with the other work. ;-)

-- 
Best regards,
Florian Pose

http://etherlab.org



More information about the Etherlab-users mailing list