[etherlab-dev] Mailbox contention

Frank Heckenbach f.heckenbach at fh-soft.de
Sat Jun 18 06:24:57 CEST 2011


- The following problem already occurred with 1.4.0: When I run e.g.
  "ethercat upload" while EoE is active, I sometimes get the error:

    Failed to upload SDO: Input/output error

  Sometimes also EoE gets disturbed:

    May 31 18:41:01 (none) kernel: [ 9864.655979] EtherCAT WARNING 0-0: Received mailbox protocol 0x02 as response.
    May 31 18:41:01 (none) kernel: [ 9864.655998] EtherCAT ERROR 0-0: Failed to process SDO request.
    May 31 18:41:01 (none) kernel: [ 9864.663872] EtherCAT WARNING 0-0: Other mailbox protocol response for eoe0s0.

  I noticed the "FIXME mailbox handler necessary" in
  ec_eoe_state_rx_fetch() where this last message originates from.

- When I access EoE while the master is reading the SDO dictionary,
  the reading is aborted, i.e. the dictionary is truncated at the
  current point. This might be the same issue.

    EtherCAT WARNING 0-0: Other mailbox protocol response for eoe0s0.
    EtherCAT ERROR 0-0: Reception of CoE SDO description response failed: No response.
    EtherCAT WARNING 0: 1 datagram UNMATCHED!
    EtherCAT WARNING 0-1: Failed to receive mbox check datagram for eoe0s1.
    EtherCAT ERROR 0-1: Mailbox error response received - Unknown error reply code 0x0000.
    EtherCAT WARNING 0-1: Invalid mailbox response for eoe0s1.
    EtherCAT WARNING 0-1: Other mailbox protocol response for eoe0s1.
    EtherCAT ERROR 0-1: Timeout while waiting for SDO entry 0x2009:0 description response.

- Similar errors when both the cyclic task and a normal task (kernel
  task or cdev) do SDO transfers at the same time, see my previous

- I suppose that even the master operations which use the
  mailbox (e.g., reading the SDO dictionary) could conflict with the
  cyclic thread if it does SDO transfer. (This would be hard to
  protect against since they may happen anytime a slave is
  connected, without the application's knowledge.)

Do I understand it correctly that the issue here is contention of
the slave's mailbox? I.e., while the master can properly multiplex
datagrams from different sources, there is only one mailbox, so if
several sources (here: the CoE FSM for SDO transfer or dictionary
reading vs. the EoE thread, or two parallel SDO transfers) access
the mailbox, the answers may get mixed up if a reply goes to the
wrong client's ec_slave_mbox_fetch() call (which happens randomly,
i.e. race condition)? (Writing to the mailbox, on the other hand,
seems unproblematic, since the slave distributes the messages
between its various functions, right?)

More generally, I wonder which kinds of operations can be done
concurrently on an EtherCAT master at all. AFAICS, for my purposes
there are at least six kinds of interesting operations:

- Operations done by the master itself (the bus scanning which is
  quite fast in my setup, but also reading the SDO dictionary which
  takes several seconds since it's a large list, so concurrent
  access is likely if I don't protect against it)

- Access through the cdev (e.g. "ethercat upload")

- EoE access

- SDO transfer by a non-realtime kernel module

- SDO transfer by the cyclic task

- PDO transfer in the cyclic task

Should there be any limitations on running them at the same time
(except natural restrictions such as that EoE or PDOs cannot be used
until the slaves are configured respectively), or should it work and
any problems I see, such as the above, are really bugs? AFAICS, the
documentation doesn't mention any restrictions, so I'd assume it
should all work.

I had hoped that a quick way to fix it was to add some kind of
mutex for the mailbox, and I tried to implement it, but it didn't
help because apparently the mailbox reads are independent of the
mailbox writes, i.e. even if the last datagram sent to a slave's
mailbox was from the SDO thread, the next datagram read from it
might well be for EoE.

So it seems that a mailbox demultiplexer (handler, state machine,
whatever) as noted in the FIXME is really needed. On the plus side,
if it's implemented, all six things should work concurrently, right?

AFAICS, the main difficulty in implementing such a demultiplexer is
that mailbox access happens from various places in various state
machines, and each mailbox read needs several states (send check,
receive check and send fetch, receive fetch), and a datagram needs
to be dispatched after it's received from the mailbox. Moreover, the
different tasks run at different speeds (EoE usually much faster)
and a separate mailbox handler would have to run as fast as the
fastest one (i.e., EoE if active, but otherwise the master idle or
operation thread).

Unfortunately, I need my application working in a few weeks, which
might not be enough time for a proper solution, so I'm also looking
for quick&dirty changes, but even this doesn't seem so easy.

Perhaps a solution that doesn't require too large-scale code changes
would be to let each task request and receive mailbox replies as it
does now, and if it gets a wrong one, don't discard it and report an
error, but store it temporarily and retry the request. Before each
request, they would check the temporary buffer for any matching

On the plus side, AFAIK each mailbox datagram has a fixed size and
we only need to store a fixed number (per slave) of them (I would
have said only one per protocol, but since there can be several SDO
transfers, that's not quite true, but at least the number is limited
somehow), so at least we wouldn't need dynamic memory allocation
except during initialization.

However, when there's already a matching datagram buffered, we'd
have to skip the check (ec_slave_mbox_prepare_check()) and jump to
another state (after the "receive detch"), and conversely when
ec_slave_mbox_fetch() returns a "wrong" datagram, we'd have to go
back to the checking state in order to retry. So it still seems to
require bigger changes in the various state machines.

I still hope I'm overlooking a simpler solution (even if quick&dirty
or somewhat inefficient etc.) because at the moment I can hardly
call the mailbox operations working at all. Even in my simple tests
I regularly get errors, as mentioned above, so I don't even dare to
run a real application with it.


Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
Systemprogrammierung, EDV-Beratung
Stubenlohstr. 6, 91052 Erlangen, Deutschland
Tel.: +49-9131-21359

More information about the Etherlab-dev mailing list