[etherlab-dev] ethercat-1.5: Various issues
f.heckenbach at fh-soft.de
Thu Jun 5 16:23:20 CEST 2014
a long time ago I reported some bugs
Meanwhile I've fixed all the problems I've encountered during my
Unfortunately, this mail is quite late. This is because I did the
respective work as an external developer for a company that uses the
EtherCAT master, and since the project was already running late (not
only because of these problems described here ;), I didn't have time
at the end to write it all up properly etc. Now, while preparing to
get back to this project with some updates, I can finally finish
these patches too. (On the positive side, the project has now been
running for 2-3 years without finding new EtherCAT related errors.)
I attach my complete set of patches, including the patches I've sent
in previous mails (02-* to 08-*, slightly adjusted;
01-ethercat-1.5-header.patch was applied in your code already).
My patches are against 1.5.0 (which was current at the time I did
the project), to be applied in file name order. I have not tried
1.5.1 or 1.5.2 yet. From the ChangeLog it looks like the changes
between those versions should only have small overlap with mine, but
of course, there may be some conflicts in the changes or their
context, since some of my patches are quite substantial, so applying
them to a newer version might not be trivial. If anything is
unclear, feel free to ask me.
The patches in detail (unless described already in the mails linked
- As I described in my previous mails, the main problems I had were
about several simultaneous mailbox users, e.g. using SDOs while
EoE is active.
Received mailbox datagrams are not properly dispatched to the
correct handler (CoE, EoE, ...) which inevitably leads to problems
when several of them run at the same time.
A proper dispatcher or mailbox state-machine seems quite difficult
to fit into the current code, so I solved it (or worked around it,
you may say) at the lower level. Here's what I did:
- Mailbox datagram structures are tagged with some additional
fields in ec_datagram_t, so the low-level routines, in
particular ec_master_send_datagrams() and
ec_master_receive_datagrams(), can recognize them and handle
them specially, see below. These new fields contain the expected
mailbox protocol, the kind of datagram (check, fetch or send) as
well as a pointer to the responsible slave.
- When a fetch reply is received and the actual mailbox protocol
(as read from the datagram contents) doesn't match the
expected protocol, the datagram is put into an internal buffer,
and another datagram from the buffer which matches the expected
protocol is put in its place if there is one -- otherwise an
error datagram is returned. (Even if the protocol matches, it
may need to be swapped with a buffered one, so datagrams are
returned in the correct order.)
- If a check reply is received, its answer is modified according
to whether there is something in the buffer:
If the check says "yes", but there's nothing in the buffer for
the expected protocol, the answer is changed to "no" to avoid
case "" because we cannot know at this point if the data in
the slave's mailbox are for the expected protocol. Furthermore,
a fetch datagram is sent directly to actually fetch the mailbox,
and its reply will be stored in the buffer, so if the client
asks the next time (with another check datagram), it will get
the data (if it was the correct protocol). So to the client it
just looks like the slave took a bit longer to fill its mailbox.
If the check says "no", but we have something buffered, the
answer is changed to "yes". Since the slave will now send a
fetch datagram, ec_master_send_datagrams() has to catch it and
mark it as received with data from the buffer without ever
sending it out.
- The details are a little more complicated than that (e.g. it
turned out that a single buffer per protocol isn't always
enough, so I made a ring buffer of (currently) 0x10 datagrams
per protocol. Also, it is necessary to time-out queued and not
yet sent datagrams; and some book-keeping is required for the
additional data structures), but that can all be seen in the
10-ethercat-1.5-mailbox-allocate-buffer.patch contain the boring
parts (preparations, new data structures) that shouldn't change
the behaviour. The main change is in
- Another problem was, when a fetch datagram is followed by a
check datagram (from another user) in the same frame, the check
will still get a "yes" answer even though the mailbox was
emptied by the fetch. This might depend on the slave devices --
I didn't find a definitive statement about this case in the
standard, but with our devices this is the observed behaviour.
To avoid it, I now make sure that a new frame is started for a
check datagram after a fetch datagram, even if the frame size
would not require it. Except for a few more bytes on the line
due to the new frame header, this should be harmless.
- Now about sending: When several sources (again e.g. SDO and EoE)
try to send to the mailbox of the same slave simultaneously (or
shortly after each other), only one of them will succeed. The
other datagrams are not processed by the slave which can be seen
by a working_counter which is still 0.
Normally, I suppose each user should retry the sending until it
succeeds. Some users do so (e.g. EoE in ec_eoe_state_tx_sent()),
but there are many places that send to the mailbox that don't
retry, so instead of fixing them all, I again did it at the
lower level and implemented a retry centrally in
- By the time a lost datagram times out, if the interface is busy,
the 8-bit datagram index may already have wrapped around and
another datagram with the same index been sent, causing confusion
in ec_master_receive_datagrams() when the latter one is received
if their type and size happens to match (which is not uncommon).
Therefore, I added a new check to avoid reusing an index until the
datagram is received or timed out.
- ec_eoe_run() seems to assume that at this point, the EoE datagram
cannot be in state EC_DATAGRAM_QUEUED. This assumption is wrong.
Even though my changes above make it more likely to happen, it
could happen before.
In fact, it could even be in state EC_DATAGRAM_INIT, e.g. when the
master lock was denied in the send attempt. This leads to an
invalid access to datagram_queue and a crash.
But we cannot check for EC_DATAGRAM_INIT here because it is also
set at the very beginning, so EoE processing would never start if
the function just returned in this state. So I introduced a new
state EC_DATAGRAM_PREQUEUED, set it in ec_eoe_queue() and check
for it in ec_eoe_run(). The "sth_to_send" check also tests for
this state, otherwise a pending datagram whose sending was once
denied would never be sent.
- Another major problem I had was frame corruption.
As master/device.h says:
* This memory ring is used to transmit frames. It is necessary to use
* different memory regions, because otherwise the network device DMA could
* send the same data twice, if it is called twice.
Indeed, that's what I saw happening, causing various errors in any
of the EtherCAT protocols. I found several questionable
assumptions in the code:
- EC_BYTE_TRANSMISSION_TIME_NS is set to 80, which is exactly the
best case time. If anything takes a little longer than best
case, on a busy EtherCAT interface it's only a matter of time
until the buffers overrun. I've added a little reserve in
ec_master_idle_thread() (just like ec_master_set_send_interval()
- The time calculation also didn't consider inter-frame gap and
frame preambles. I added just the minimum (20 bytes).
- ec_master_idle_thread() only considered the last frame sent to
calculate the waiting time. However, ec_master_send_datagrams()
may send several frames. Therefore, I now have
ec_master_send_datagrams() compute and return the total number
of bytes sent (including gaps).
- If ec_master_send_datagrams() sends more than EC_TX_RING_SIZE
frames, all waiting time is pointless since it will overwrite
its own data before there is a chance to sleep.
BTW, for debugging this problem, I used another PC with 2 NICs
bridged (using brctl). By running Wireshark on the bridge
interface, I could see damaged packets coming from the EtherCAT
master which actually contained copies of (the initial part of)
the frame after next (easy to identify by the datagram index,
but also the rest of the data matched) which to me clearly
confirms that the buffer was overwritten when 2 more frames were
queued before this one was sent out (with EC_TX_RING_SIZE == 2).
Therefore, I limit the number of frames it will send at once to
- Even with all those changes, I still got corrupted frames
(though less often than before). So I just increased
EC_TX_RING_SIZE to 0x10. This is still heuristic, of course, but
at least I haven't seen any frame corruption since then.
- EoE: The TX frame was not properly cleaned up when the send
datagram was not received or got no response
(working_counter != 1). This caused unregister_netdev() to hang
with the following syslog message repeating forever, and with
rtnl_mutex held, so the whole networking subsystem remained locked
when trying to unload the EC module and the system became mostly
unusable till a reboot.
unregister_netdevice: waiting for eoe0s1 to become free. Usage count = 4
- As mentioned in previous mails, some other serious problems I had
were about locking:
- I wonder what is meant to protect access to datagram_queue. The
comments are not quite clear, but according to master.h, io_sem
is the "Semaphore used in IDLE phase", and looking at the code I
figure it is meant to protect datagram_queue in idle phase,
whereas application-specific locking should do it during
However, ec_master_queue_external_datagram() uses io_sem and can
be called during operation phase, e.g.:
After I backported code from your repository to add locking in
(18-ethercat-1.5-locking-fix-backport.patch), also the following
sequence became possible:
Also, several places in cdev.c (lines 1840, 1859, 1924, 1943,
1962, 1983) use io_sem, and cdev can be used during operation
accesses datagram_queue without acquiring io_sem. It uses
master_sem instead which seems to be wrong in any case. Using
io_sem instead at least puts it on the same level as the other
cdev functions mentioned before.
I see that in newer versions (e.g. commit 53b5128e1313), you
apparently reverted the callback mechanism from send/receive
callbacks back to lock/unlock callbacks as it was in 1.4. I also
prefer the latter since they can be used more generally.
Therefore I made the respective changes in my 1.5 copy too, but
a little differently. In particular, I use the callbacks also in
the cdev routines and just anywhere io_sem was used. (io_sem is
now only used in the default callbacks themselves.)
- In examples/rtai you removed the t_critical check completely in
newer versions (the value is still computed, but never used), so
a non-RT access that happens at a bad time can now delay the
execution of the cyclic task. Is this a good idea? What I did
instead is to have the callbacks check t_critical (as before)
and sleep (schedule()) when too close. This way they will always
succeed (as in your version, no return code needed), but cannot
block the cyclic task (provided timings are computed correctly).
A fine point is that I now need a flag to tell when the cyclic
task was stopped. Otherwise, if cleanup took too long, it could
happen that e.g. stopping EoE would hang forever as it tried to
get the lock because the "critical" time was already reached and
the cyclic task would never run again and update the time.
Also, I think t_last_cycle must be volatile.
- However, I still think (as discussed previously) that using RTAI
semaphores in non-RT tasks (i.e. the callbacks) is wrong.
According to the RTAI developer, it is necessary that the
current task is "RT hardened" in order to be able to use RTAI
semaphores. But since I use the callbacks also from the cdev
functions now, any Linux process can use them and there is no
way to ensure that the caller is RT hardened. OTOH, we can't use
normal kernel semaphores in RTAI code.
So I now use an atomic flag as a hand-made non-blocking
semaphore, and each user can wait for it in its own way
(rt_sleep() for the RTAI task, schedule() in the callbacks).
- A minor point: ext_queue_sem is meant to protect
ext_datagram_queue. But it's not used in ecrt_master_send_ext()
where ext_datagram_queue is accessed. Instead it's used around
the external send callbacks, but it misses the call from
ec_master_internal_send_cb(). In the end, it currently doesn't
matter since both ec_eoe_queue() and send_cb() are only called
from ec_master_eoe_thread() and are therefore automatically
serialized, so the semaphore is actually pointless ATM. But if
it ever gets important, it should be acquired in
- Though I don't use FoE, I happened to notice the use of a wrong
wait queue there.
- The idle thread doesn't call ec_master_output_stats() regularly,
so it's only called after a relevant problem, but only outputs
information once a second, so the remaining statistics are
- When a slave's mailbox contains some old data when the master is
restarted (this happens almost reproducibly when restarting the
master while it's reading the SDO dictionary), the first mailbox
response is misinterpreted which (in my case) typically results in
an error like this:
EtherCAT ERROR 0-0: Received unknown response while uploading SDO 0x1C12:00.
EtherCAT ERROR 0-0: Failed to read number of assigned PDOs for SM2.
EtherCAT ERROR 0-0: Received upload response for wrong SDO (0x1C12:00, requested: 0x1C13:00).
To avoid this, I fetch the mailbox once before using it for the
first time, ignoring any result, whether empty or not.
- Even after the changes above, several simultaneous CoE (in
particular SDO) requests can still get mixed up, since they have
the same protocol number, so my mailbox "dispatcher" doesn't help.
I must say that I don't really understand the separation between
master and slave state machines, both of which have their own CoE
state machines, and the corresponding separation of "internal" and
"external" datagrams. For what I can tell, both are treated
completely differently most of the way, but in the end they do
exactly the same. (Of course, what else? All actual EtherCAT
communication is between a master and a slave, there is no
distinct "master CoE" and "slave CoE" protocol.) Problems occur
when both master and slave state machine try to do CoE operations
at the same time, because the mailbox responses get mixed up.
Since I didn't want to make larger changes to the code structure
now, such as merging those state machines, I changed the code so
whichever state machine starts a CoE operation has exclusive
access to CoE for this slave until the operation is finished or
timed out. (Basically like a semaphore, except the state machines
run in the same thread, so an actual semaphore would deadlock.)
The next problem then is that some code (e.g.
ec_fsm_master_exec()) just assumes that the FSM has a datagram to
send out in every state, so it always returns 1 unless it's
waiting for a reply. With my previous change, this isn't the case
anymore, and it cannot be -- unless I'd block the FSM completely
while another CoE operation is in progress. (I thought about it,
but it might degrade performance, if e.g. a longer-running SDO
transfer in the slave FSM could block unrelated operations in the
master FSM.) For this reason I introduced a new state
EC_DATAGRAM_INVALID, set it when the FSM is blocked and make
ec_fsm_master_exec() return 0 if so (to take care of the master
FSM), and ec_master_queue_external_datagrams() ignore it in this
case (to take care of the slave FSM). (No, I don't particularly
like this solution, but I don't see another way without
- There is no way AFAICS to find out when reading the slaves' SDO
dictionaries is finished. This not only affects reliable operation
when one wants to use the dictionary, but also performance, since
reading is a CoE operation that runs in the master state machine
and so, even with my previous patch, blocks other CoE (in
particular, SDO) operations, since
ec_fsm_master_action_process_sdo() is never reached while the
dictionary is being read, so it's not reasonable to start an
application task which uses SDOs in this situation. Therefore I
added the sdo_dictionary_fetched flag to the state returned by
ecrt_slave_config_state(), and don't set the flag until the
reading is finished -- rather than until started, as before, which
for the other purpose of this flag makes no difference. It's still
up to the application to request this state and react to it.
I also added this flag to ec_ioctl_slave_t and let "ethercat sdo"
report if it's not set. This avoids outputting an incomplete list
of SDOs, with (usually) a bogus error message
EtherCAT: ERROR 0-0: SDO entry 0xXXXX:YY does not exist!
for the entry currently being fetched, and it provides a reliable
way for start scripts to wait until fetching is completed. (It's
better for me to do it in a start script than in the application
module since the master enters operation phase as soon as it is
requested. Therefore I wait in my script using the output of
"ethercat sdo", and use the new flag in ecrt_slave_config_state()
only to let my application module verify that fetching was
- The init script needs an "exit" in the "restart" case to avoid
hitting the error exit at the end.
- I see you mostly took my code for SDO up-/downloads I suggested
though you take a slave_position instead of an ec_slave_config_t
parameter. Apparently you did this so you could have the same
interface in the kernel and user space (though I've always
wondered why you don't use alias and position in user-space too).
This raises the question how to get the absolute position, as
Graeme Foot asked in
(and got no answer). Well, I'm now using the same work-around
Graeme described in this mail. Since I only need the
ecrt_master_get_slave() calls during initialization, the overhead
doesn't matter much to me, though I don't think it's a really nice
interface, when used together with the other kernel functions.
However, I still had to make a few changes:
- In case of EINTR and also at the end of
ecrt_master_sdo_download(), I think you forget to clear the
request (and thus free the allocated memory).
- The "data" parameter to ecrt_master_sdo_download() should be
const. (While doing related changes, I noticed a duplicate block
of code in CommandDownload.cpp; AFAICT the latter copy is
spurious and my patch removes it.)
Back then I asked which kinds of operations can be done concurrently
on an EtherCAT master. Now I can finally answer my own question.
With my patches, I can do all of the following at the same time:
- Operations done by the master (bus scan, dictionary fetching, etc.)
- Access through the cdev (e.g. "ethercat upload")
- EoE access
- SDO transfer by a non-realtime kernel module
- SDO transfer by the cyclic task
- PDO transfer in the cyclic task
Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359
Systems Programming, Software Development, IT Consulting
-------------- n?chster Teil --------------
Ein Dateianhang mit Bin?rdaten wurde abgetrennt...
Dateiname : ethercat-1.5.0-patches.tar.bz2
Dateityp : application/octet-stream
Dateigr??e : 18789 bytes
Beschreibung: nicht verf?gbar
URL : <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20140605/cffe1d24/attachment-0001.obj>
More information about the etherlab-dev