[etherlab-dev] ethercat-1.5: Various issues

Mon Jun 23 15:54:35 CEST 2014

Hi,

sorry for replying so late. I just finished reading the first 7 patches at
the beginning which are not related to the mailbox problem.

2) I haven't used the e1000 driver yet. I think the first change should be
no problem. I'm not sure about the second one, but since Frank had problem
with it, and this patch fixed the problem. (see
http://lists.etherlab.org/pipermail/etherlab-users/2011/001190.html) I
believe it should be OK.

3) A workaround to temporarily disable the debug interface.
http://lists.etherlab.org/pipermail/etherlab-users/2011/001205.html
Well, I'm not familiar with the debug interface. And the debug interfaces
EC_DEBUG_IF --enable-debug-if is by default deactivated. I can’t judge if
it is necessary.

4) quote http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html

"ethercat download" with a string type cuts the input string at
  the first space (but the size is given correctly, so for the rest
  of the string garbage is sent). This is due to the behaviour of
  ">>" and easily fixed using "read"

Bug confirmed. It should be merged into the repository.

5) "Scheduling while atomic" means that a thread has called schedule()
during an operation which is supposed to be atomic (ie uninterrupted). Use
spinlock instead of semaphore for EoE resolve this problem.
As Florian chose to use semaphore in the changeset 1489:f77a118 with the
comment "EoE processing with kthread", I think we need to listen what he
said about it. I don't know if anyone else has encountered the same problem.

6) Has anyone else got problem with the RTAI example? I have no experience
with RTAI, not sure if the patch is necessary.

7) Except those changes which have already been added into the 1.5.2, we
have
a. "The "data" parameter to ecrt_master_sdo_download() should be const."
Though It is nice to have it const, the interface is changed, and those old
binaries which linked to the old shared library will not work any more with
the new one. I'll leave it as it is.
b. "removes a duplicate block of code in CommandDownload.cpp".  Confirmed.
should be merged into the repository.

8) good, it makes mrproper more clean. It works.

9)10)11)
I have a feeling it could be better trying to have a new struct
ec_mailbox_t in mailbox.h without touch the ec_datagram_t, since not every
datagram is a mailbox datagram. I'll need time to dig into it further.

On 5 June 2014 16:23, Frank Heckenbach <f.heckenbach at fh-soft.de> wrote:

> Hi,
>
> a long time ago I reported some bugs
> (http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html and
> http://lists.etherlab.org/pipermail/etherlab-users/2011/001272.html).
> Meanwhile I've fixed all the problems I've encountered during my
> project.
>
> Unfortunately, this mail is quite late. This is because I did the
> respective work as an external developer for a company that uses the
> EtherCAT master, and since the project was already running late (not
> only because of these problems described here ;), I didn't have time
> at the end to write it all up properly etc. Now, while preparing to
> get back to this project with some updates, I can finally finish
> these patches too. (On the positive side, the project has now been
> running for 2-3 years without finding new EtherCAT related errors.)
>
> I attach my complete set of patches, including the patches I've sent
> in previous mails (02-* to 08-*, slightly adjusted;
> 01-ethercat-1.5-header.patch was applied in your code already).
>
> My patches are against 1.5.0 (which was current at the time I did
> the project), to be applied in file name order. I have not tried
> 1.5.1 or 1.5.2 yet. From the ChangeLog it looks like the changes
> between those versions should only have small overlap with mine, but
> of course, there may be some conflicts in the changes or their
> context, since some of my patches are quite substantial, so applying
> them to a newer version might not be trivial. If anything is
> unclear, feel free to ask me.
>
> The patches in detail (unless described already in the mails linked
> above):
>
> - As I described in my previous mails, the main problems I had were
>   about several simultaneous mailbox users, e.g. using SDOs while
>   EoE is active.
>
>   Received mailbox datagrams are not properly dispatched to the
>   correct handler (CoE, EoE, ...) which inevitably leads to problems
>   when several of them run at the same time.
>
>   A proper dispatcher or mailbox state-machine seems quite difficult
>   to fit into the current code, so I solved it (or worked around it,
>   you may say) at the lower level. Here's what I did:
>
>   - Mailbox datagram structures are tagged with some additional
>     fields in ec_datagram_t, so the low-level routines, in
>     particular ec_master_send_datagrams() and
>     ec_master_receive_datagrams(), can recognize them and handle
>     them specially, see below. These new fields contain the expected
>     mailbox protocol, the kind of datagram (check, fetch or send) as
>     well as a pointer to the responsible slave.
>
>   - When a fetch reply is received and the actual mailbox protocol
>     (as read from the datagram contents) doesn't match the
>     expected protocol, the datagram is put into an internal buffer,
>     and another datagram from the buffer which matches the expected
>     protocol is put in its place if there is one -- otherwise an
>     error datagram is returned[1]. (Even if the protocol matches, it
>     may need to be swapped with a buffered one, so datagrams are
>     returned in the correct order.)
>
>   - If a check reply is received, its answer is modified according
>     to whether there is something in the buffer:
>
>     If the check says "yes", but there's nothing in the buffer for
>     the expected protocol, the answer is changed to "no" to avoid
>     case "[1]" because we cannot know at this point if the data in
>     the slave's mailbox are for the expected protocol. Furthermore,
>     a fetch datagram is sent directly to actually fetch the mailbox,
>     and its reply will be stored in the buffer, so if the client
>     asks the next time (with another check datagram), it will get
>     the data (if it was the correct protocol). So to the client it
>     just looks like the slave took a bit longer to fill its mailbox.
>
>     If the check says "no", but we have something buffered, the
>     answer is changed to "yes". Since the slave will now send a
>     fetch datagram, ec_master_send_datagrams() has to catch it and
>     mark it as received with data from the buffer without ever
>     sending it out.
>
>   - The details are a little more complicated than that (e.g. it
>     turned out that a single buffer per protocol isn't always
>     enough, so I made a ring buffer of (currently) 0x10 datagrams
>     per protocol. Also, it is necessary to time-out queued and not
>     yet sent datagrams; and some book-keeping is required for the
>     additional data structures), but that can all be seen in the
>     patches.
>
>     (09-ethercat-1.5-mailbox-tag.patch and
>     10-ethercat-1.5-mailbox-allocate-buffer.patch contain the boring
>     parts (preparations, new data structures) that shouldn't change
>     the behaviour. The main change is in
>     11-ethercat-1.5-mailbox-buffer.patch.)
>
>   - Another problem was, when a fetch datagram is followed by a
>     check datagram (from another user) in the same frame, the check
>     will still get a "yes" answer even though the mailbox was
>     emptied by the fetch. This might depend on the slave devices --
>     I didn't find a definitive statement about this case in the
>     standard, but with our devices this is the observed behaviour.
>     To avoid it, I now make sure that a new frame is started for a
>     check datagram after a fetch datagram, even if the frame size
>     would not require it. Except for a few more bytes on the line
>     due to the new frame header, this should be harmless.
>
>     (12-ethercat-1.5-fetch-check.patch)
>
>   - Now about sending: When several sources (again e.g. SDO and EoE)
>     try to send to the mailbox of the same slave simultaneously (or
>     shortly after each other), only one of them will succeed. The
>     other datagrams are not processed by the slave which can be seen
>     by a working_counter which is still 0.
>
>     Normally, I suppose each user should retry the sending until it
>     succeeds. Some users do so (e.g. EoE in ec_eoe_state_tx_sent()),
>     but there are many places that send to the mailbox that don't
>     retry, so instead of fixing them all, I again did it at the
>     lower level and implemented a retry centrally in
>     ec_master_receive_datagrams().
>
>     (13-ethercat-1.5-send-retry.patch)
>
> - By the time a lost datagram times out, if the interface is busy,
>   the 8-bit datagram index may already have wrapped around and
>   another datagram with the same index been sent, causing confusion
>   in ec_master_receive_datagrams() when the latter one is received
>   if their type and size happens to match (which is not uncommon).
>   Therefore, I added a new check to avoid reusing an index until the
>   datagram is received or timed out.
>
>   (14-ethercat-1.5-index-reuse.patch)
>
> - ec_eoe_run() seems to assume that at this point, the EoE datagram
>   cannot be in state EC_DATAGRAM_QUEUED. This assumption is wrong.
>   Even though my changes above make it more likely to happen, it
>   could happen before.
>
>   In fact, it could even be in state EC_DATAGRAM_INIT, e.g. when the
>   master lock was denied in the send attempt. This leads to an
>   invalid access to datagram_queue and a crash.
>
>   But we cannot check for EC_DATAGRAM_INIT here because it is also
>   set at the very beginning, so EoE processing would never start if
>   the function just returned in this state. So I introduced a new
>   state EC_DATAGRAM_PREQUEUED, set it in ec_eoe_queue() and check
>   for it in ec_eoe_run(). The "sth_to_send" check also tests for
>   this state, otherwise a pending datagram whose sending was once
>   denied would never be sent.
>
>   (15-ethercat-1.5-eoe-prequeue.patch)
>
> - Another major problem I had was frame corruption.
>   As master/device.h says:
>
>   * This memory ring is used to transmit frames. It is necessary to use
>   * different memory regions, because otherwise the network device DMA
> could
>   * send the same data twice, if it is called twice.
>
>   Indeed, that's what I saw happening, causing various errors in any
>   of the EtherCAT protocols. I found several questionable
>   assumptions in the code:
>
>   - EC_BYTE_TRANSMISSION_TIME_NS is set to 80, which is exactly the
>     best case time. If anything takes a little longer than best
>     case, on a busy EtherCAT interface it's only a matter of time
>     until the buffers overrun. I've added a little reserve in
>     ec_master_idle_thread() (just like ec_master_set_send_interval()
>     did already).
>
>   - The time calculation also didn't consider inter-frame gap and
>     frame preambles. I added just the minimum (20 bytes).
>
>   - ec_master_idle_thread() only considered the last frame sent to
>     calculate the waiting time. However, ec_master_send_datagrams()
>     may send several frames. Therefore, I now have
>     ec_master_send_datagrams() compute and return the total number
>     of bytes sent (including gaps).
>
>   - If ec_master_send_datagrams() sends more than EC_TX_RING_SIZE
>     frames, all waiting time is pointless since it will overwrite
>     its own data before there is a chance to sleep.
>
>     BTW, for debugging this problem, I used another PC with 2 NICs
>     bridged (using brctl). By running Wireshark on the bridge
>     interface, I could see damaged packets coming from the EtherCAT
>     master which actually contained copies of (the initial part of)
>     the frame after next (easy to identify by the datagram index,
>     but also the rest of the data matched) which to me clearly
>     confirms that the buffer was overwritten when 2 more frames were
>     queued before this one was sent out (with EC_TX_RING_SIZE == 2).
>
>     Therefore, I limit the number of frames it will send at once to
>     EC_TX_RING_SIZE.
>
>   - Even with all those changes, I still got corrupted frames
>     (though less often than before). So I just increased
>     EC_TX_RING_SIZE to 0x10. This is still heuristic, of course, but
>     at least I haven't seen any frame corruption since then.
>
>   (16-ethercat-1.5-frame-corruption.patch)
>
> - EoE: The TX frame was not properly cleaned up when the send
>   datagram was not received or got no response
>   (working_counter != 1). This caused unregister_netdev() to hang
>   with the following syslog message repeating forever, and with
>   rtnl_mutex held, so the whole networking subsystem remained locked
>   when trying to unload the EC module and the system became mostly
>   unusable till a reboot.
>
>     unregister_netdevice: waiting for eoe0s1 to become free. Usage count =
> 4
>
>   (17-ethercat-1.5-eoe-tx-cleanup.patch)
>
> - As mentioned in previous mails, some other serious problems I had
>   were about locking:
>
>   - I wonder what is meant to protect access to datagram_queue. The
>     comments are not quite clear, but according to master.h, io_sem
>     is the "Semaphore used in IDLE phase", and looking at the code I
>     figure it is meant to protect datagram_queue in idle phase,
>     whereas application-specific locking should do it during
>     operation phase.
>
>     However, ec_master_queue_external_datagram() uses io_sem and can
>     be called during operation phase, e.g.:
>       ec_master_operation_thread()
>     ->ec_fsm_slave_exec()
>     ->ec_fsm_slave_state_ready()
>     ->ec_fsm_slave_action_process_sdo()
>     ->ec_master_queue_external_datagram()
>
>     After I backported code from your repository to add locking in
>     ec_master_clear_slaves()
>     (18-ethercat-1.5-locking-fix-backport.patch), also the following
>     sequence became possible:
>       ec_master_operation_thread()
>     ->ec_fsm_master_exec()
>     ->ec_fsm_master_state_broadcast()
>     ->ec_master_clear_slaves()
>
>     Also, several places in cdev.c (lines 1840, 1859, 1924, 1943,
>     1962, 1983) use io_sem, and cdev can be used during operation
>     phase.
>
>     OTOH,
>       ec_cdev_ioctl_domain_queue()
>     ->ecrt_domain_queue()
>     ->ec_master_queue_datagram()
>     accesses datagram_queue without acquiring io_sem. It uses
>     master_sem instead which seems to be wrong in any case. Using
>     io_sem instead at least puts it on the same level as the other
>     cdev functions mentioned before.
>
>     (19-ethercat-1.5-io-sem.patch)
>
>     I see that in newer versions (e.g. commit 53b5128e1313), you
>     apparently reverted the callback mechanism from send/receive
>     callbacks back to lock/unlock callbacks as it was in 1.4. I also
>     prefer the latter since they can be used more generally.
>     Therefore I made the respective changes in my 1.5 copy too, but
>     a little differently. In particular, I use the callbacks also in
>     the cdev routines and just anywhere io_sem was used. (io_sem is
>     now only used in the default callbacks themselves.)
>
>     (20-ethercat-1.5-callbacks.patch)
>
>   - In examples/rtai you removed the t_critical check completely in
>     newer versions (the value is still computed, but never used), so
>     a non-RT access that happens at a bad time can now delay the
>     execution of the cyclic task. Is this a good idea? What I did
>     instead is to have the callbacks check t_critical (as before)
>     and sleep (schedule()) when too close. This way they will always
>     succeed (as in your version, no return code needed), but cannot
>     block the cyclic task (provided timings are computed correctly).
>
>     A fine point is that I now need a flag to tell when the cyclic
>     task was stopped. Otherwise, if cleanup took too long, it could
>     happen that e.g. stopping EoE would hang forever as it tried to
>     get the lock because the "critical" time was already reached and
>     the cyclic task would never run again and update the time.
>
>     Also, I think t_last_cycle must be volatile.
>
>     (21-ethercat-1.5-t-critical.patch)
>
>   - However, I still think (as discussed previously) that using RTAI
>     semaphores in non-RT tasks (i.e. the callbacks) is wrong.
>     According to the RTAI developer, it is necessary that the
>     current task is "RT hardened" in order to be able to use RTAI
>     semaphores. But since I use the callbacks also from the cdev
>     functions now, any Linux process can use them and there is no
>     way to ensure that the caller is RT hardened. OTOH, we can't use
>     normal kernel semaphores in RTAI code.
>
>     So I now use an atomic flag as a hand-made non-blocking
>     semaphore, and each user can wait for it in its own way
>     (rt_sleep() for the RTAI task, schedule() in the callbacks).
>
>     (22-ethercat-1.5-rtai-lock.patch)
>
>   - A minor point: ext_queue_sem is meant to protect
>     ext_datagram_queue. But it's not used in ecrt_master_send_ext()
>     where ext_datagram_queue is accessed. Instead it's used around
>     the external send callbacks, but it misses the call from
>     ec_master_internal_send_cb(). In the end, it currently doesn't
>     matter since both ec_eoe_queue() and send_cb() are only called
>     from ec_master_eoe_thread() and are therefore automatically
>     serialized, so the semaphore is actually pointless ATM. But if
>     it ever gets important, it should be acquired in
>     ecrt_master_send_ext() instead.
>
>     (23-ethercat-1.5-ext-queue-sem.patch)
>
> - Though I don't use FoE, I happened to notice the use of a wrong
>   wait queue there.
>
>   (24-ethercat-1.5-foe-queue.patch)
>
> - The idle thread doesn't call ec_master_output_stats() regularly,
>   so it's only called after a relevant problem, but only outputs
>   information once a second, so the remaining statistics are
>   swallowed.
>
>   (25-ethercat-1.5-output-stats.patch)
>
> - When a slave's mailbox contains some old data when the master is
>   restarted (this happens almost reproducibly when restarting the
>   master while it's reading the SDO dictionary), the first mailbox
>   response is misinterpreted which (in my case) typically results in
>   an error like this:
>
>   EtherCAT ERROR 0-0: Received unknown response while uploading SDO
> 0x1C12:00.
>   EtherCAT ERROR 0-0: Failed to read number of assigned PDOs for SM2.
>
>   Followed by:
>
>   EtherCAT ERROR 0-0: Received upload response for wrong SDO (0x1C12:00,
> requested: 0x1C13:00).
>
>   To avoid this, I fetch the mailbox once before using it for the
>   first time, ignoring any result, whether empty or not.
>
>   (26-ethercat-1.5-clear-mailbox.patch)
>
> - Even after the changes above, several simultaneous CoE (in
>   particular SDO) requests can still get mixed up, since they have
>   the same protocol number, so my mailbox "dispatcher" doesn't help.
>
>   I must say that I don't really understand the separation between
>   master and slave state machines, both of which have their own CoE
>   state machines, and the corresponding separation of "internal" and
>   "external" datagrams. For what I can tell, both are treated
>   completely differently most of the way, but in the end they do
>   exactly the same. (Of course, what else? All actual EtherCAT
>   communication is between a master and a slave, there is no
>   distinct "master CoE" and "slave CoE" protocol.) Problems occur
>   when both master and slave state machine try to do CoE operations
>   at the same time, because the mailbox responses get mixed up.
>
>   Since I didn't want to make larger changes to the code structure
>   now, such as merging those state machines, I changed the code so
>   whichever state machine starts a CoE operation has exclusive
>   access to CoE for this slave until the operation is finished or
>   timed out. (Basically like a semaphore, except the state machines
>   run in the same thread, so an actual semaphore would deadlock.)
>
>   The next problem then is that some code (e.g.
>   ec_fsm_master_exec()) just assumes that the FSM has a datagram to
>   send out in every state, so it always returns 1 unless it's
>   waiting for a reply. With my previous change, this isn't the case
>   anymore, and it cannot be -- unless I'd block the FSM completely
>   while another CoE operation is in progress. (I thought about it,
>   but it might degrade performance, if e.g. a longer-running SDO
>   transfer in the slave FSM could block unrelated operations in the
>   master FSM.) For this reason I introduced a new state
>   EC_DATAGRAM_INVALID, set it when the FSM is blocked and make
>   ec_fsm_master_exec() return 0 if so (to take care of the master
>   FSM), and ec_master_queue_external_datagrams() ignore it in this
>   case (to take care of the slave FSM). (No, I don't particularly
>   like this solution, but I don't see another way without
>   large-scale changes.)
>
>   (27-ethercat-1.5-coe-lock.patch)
>
> - There is no way AFAICS to find out when reading the slaves' SDO
>   dictionaries is finished. This not only affects reliable operation
>   when one wants to use the dictionary, but also performance, since
>   reading is a CoE operation that runs in the master state machine
>   and so, even with my previous patch, blocks other CoE (in
>   particular, SDO) operations, since
>   ec_fsm_master_action_process_sdo() is never reached while the
>   dictionary is being read, so it's not reasonable to start an
>   application task which uses SDOs in this situation. Therefore I
>   added the sdo_dictionary_fetched flag to the state returned by
>   ecrt_slave_config_state(), and don't set the flag until the
>   reading is finished -- rather than until started, as before, which
>   for the other purpose of this flag makes no difference. It's still
>   up to the application to request this state and react to it.
>
>   I also added this flag to ec_ioctl_slave_t and let "ethercat sdo"
>   report if it's not set. This avoids outputting an incomplete list
>   of SDOs, with (usually) a bogus error message
>     EtherCAT: ERROR 0-0: SDO entry 0xXXXX:YY does not exist!
>   for the entry currently being fetched, and it provides a reliable
>   way for start scripts to wait until fetching is completed. (It's
>   better for me to do it in a start script than in the application
>   module since the master enters operation phase as soon as it is
>   requested. Therefore I wait in my script using the output of
>   "ethercat sdo", and use the new flag in ecrt_slave_config_state()
>   only to let my application module verify that fetching was
>   completed.)
>
>   (28-ethercat-1.5-dictionary-fetched.patch)
>
> - The init script needs an "exit" in the "restart" case to avoid
>   hitting the error exit at the end.
>
>   (29-ethercat-1.5-init-restart.patch)
>
> - I see you mostly took my code for SDO up-/downloads I suggested
>   previously
>   (http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html),
>   though you take a slave_position instead of an ec_slave_config_t
>   parameter. Apparently you did this so you could have the same
>   interface in the kernel and user space (though I've always
>   wondered why you don't use alias and position in user-space too).
>   This raises the question how to get the absolute position, as
>   Graeme Foot asked in
>   http://lists.etherlab.org/pipermail/etherlab-users/2011/001412.html
>   (and got no answer). Well, I'm now using the same work-around
>   Graeme described in this mail. Since I only need the
>   ecrt_master_get_slave() calls during initialization, the overhead
>   doesn't matter much to me, though I don't think it's a really nice
>   interface, when used together with the other kernel functions.
>
>   However, I still had to make a few changes:
>
>   - In case of EINTR and also at the end of
>     ecrt_master_sdo_download(), I think you forget to clear the
>     request (and thus free the allocated memory).
>
>   - The "data" parameter to ecrt_master_sdo_download() should be
>     const. (While doing related changes, I noticed a duplicate block
>     of code in CommandDownload.cpp; AFAICT the latter copy is
>     spurious and my patch removes it.)
>
>   (updated 07-ethercat-1.5-sdo-up-download.patch)
>
> Back then I asked which kinds of operations can be done concurrently
> on an EtherCAT master. Now I can finally answer my own question.
> With my patches, I can do all of the following at the same time:
>
> - Operations done by the master (bus scan, dictionary fetching, etc.)
> - Access through the cdev (e.g. "ethercat upload")
> - EoE access
> - SDO transfer by a non-realtime kernel module
> - SDO transfer by the cyclic task
> - PDO transfer in the cyclic task
>
> Regards,
> Frank
>
> --
> Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
> Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359
> Systems Programming, Software Development, IT Consulting
>
> _______________________________________________
> etherlab-dev mailing list
> etherlab-dev at etherlab.org
> http://lists.etherlab.org/mailman/listinfo/etherlab-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20140623/defb0710/attachment-0001.html>