[etherlab-dev] ethercat-1.5: Various issues

Thu Jun 5 16:23:20 CEST 2014

Hi,

a long time ago I reported some bugs
(http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html and
http://lists.etherlab.org/pipermail/etherlab-users/2011/001272.html).
Meanwhile I've fixed all the problems I've encountered during my
project.

Unfortunately, this mail is quite late. This is because I did the
respective work as an external developer for a company that uses the
EtherCAT master, and since the project was already running late (not
only because of these problems described here ;), I didn't have time
at the end to write it all up properly etc. Now, while preparing to
get back to this project with some updates, I can finally finish
these patches too. (On the positive side, the project has now been
running for 2-3 years without finding new EtherCAT related errors.)

I attach my complete set of patches, including the patches I've sent
in previous mails (02-* to 08-*, slightly adjusted;
01-ethercat-1.5-header.patch was applied in your code already).

My patches are against 1.5.0 (which was current at the time I did
the project), to be applied in file name order. I have not tried
1.5.1 or 1.5.2 yet. From the ChangeLog it looks like the changes
between those versions should only have small overlap with mine, but
of course, there may be some conflicts in the changes or their
context, since some of my patches are quite substantial, so applying
them to a newer version might not be trivial. If anything is
unclear, feel free to ask me.

The patches in detail (unless described already in the mails linked
above):

- As I described in my previous mails, the main problems I had were
  about several simultaneous mailbox users, e.g. using SDOs while
  EoE is active.

  Received mailbox datagrams are not properly dispatched to the
  correct handler (CoE, EoE, ...) which inevitably leads to problems
  when several of them run at the same time.

  A proper dispatcher or mailbox state-machine seems quite difficult
  to fit into the current code, so I solved it (or worked around it,
  you may say) at the lower level. Here's what I did:

  - Mailbox datagram structures are tagged with some additional
    fields in ec_datagram_t, so the low-level routines, in
    particular ec_master_send_datagrams() and
    ec_master_receive_datagrams(), can recognize them and handle
    them specially, see below. These new fields contain the expected
    mailbox protocol, the kind of datagram (check, fetch or send) as
    well as a pointer to the responsible slave.

  - When a fetch reply is received and the actual mailbox protocol
    (as read from the datagram contents) doesn't match the
    expected protocol, the datagram is put into an internal buffer,
    and another datagram from the buffer which matches the expected
    protocol is put in its place if there is one -- otherwise an
    error datagram is returned[1]. (Even if the protocol matches, it
    may need to be swapped with a buffered one, so datagrams are
    returned in the correct order.)

  - If a check reply is received, its answer is modified according
    to whether there is something in the buffer:

    If the check says "yes", but there's nothing in the buffer for
    the expected protocol, the answer is changed to "no" to avoid
    case "[1]" because we cannot know at this point if the data in
    the slave's mailbox are for the expected protocol. Furthermore,
    a fetch datagram is sent directly to actually fetch the mailbox,
    and its reply will be stored in the buffer, so if the client
    asks the next time (with another check datagram), it will get
    the data (if it was the correct protocol). So to the client it
    just looks like the slave took a bit longer to fill its mailbox.

    If the check says "no", but we have something buffered, the
    answer is changed to "yes". Since the slave will now send a
    fetch datagram, ec_master_send_datagrams() has to catch it and
    mark it as received with data from the buffer without ever
    sending it out.

  - The details are a little more complicated than that (e.g. it
    turned out that a single buffer per protocol isn't always
    enough, so I made a ring buffer of (currently) 0x10 datagrams
    per protocol. Also, it is necessary to time-out queued and not
    yet sent datagrams; and some book-keeping is required for the
    additional data structures), but that can all be seen in the
    patches.

    (09-ethercat-1.5-mailbox-tag.patch and
    10-ethercat-1.5-mailbox-allocate-buffer.patch contain the boring
    parts (preparations, new data structures) that shouldn't change
    the behaviour. The main change is in
    11-ethercat-1.5-mailbox-buffer.patch.)

  - Another problem was, when a fetch datagram is followed by a
    check datagram (from another user) in the same frame, the check
    will still get a "yes" answer even though the mailbox was
    emptied by the fetch. This might depend on the slave devices --
    I didn't find a definitive statement about this case in the
    standard, but with our devices this is the observed behaviour.
    To avoid it, I now make sure that a new frame is started for a
    check datagram after a fetch datagram, even if the frame size
    would not require it. Except for a few more bytes on the line
    due to the new frame header, this should be harmless.

    (12-ethercat-1.5-fetch-check.patch)

  - Now about sending: When several sources (again e.g. SDO and EoE)
    try to send to the mailbox of the same slave simultaneously (or
    shortly after each other), only one of them will succeed. The
    other datagrams are not processed by the slave which can be seen
    by a working_counter which is still 0.

    Normally, I suppose each user should retry the sending until it
    succeeds. Some users do so (e.g. EoE in ec_eoe_state_tx_sent()),
    but there are many places that send to the mailbox that don't
    retry, so instead of fixing them all, I again did it at the
    lower level and implemented a retry centrally in
    ec_master_receive_datagrams().

    (13-ethercat-1.5-send-retry.patch)

- By the time a lost datagram times out, if the interface is busy,
  the 8-bit datagram index may already have wrapped around and
  another datagram with the same index been sent, causing confusion
  in ec_master_receive_datagrams() when the latter one is received
  if their type and size happens to match (which is not uncommon).
  Therefore, I added a new check to avoid reusing an index until the
  datagram is received or timed out.

  (14-ethercat-1.5-index-reuse.patch)

- ec_eoe_run() seems to assume that at this point, the EoE datagram
  cannot be in state EC_DATAGRAM_QUEUED. This assumption is wrong.
  Even though my changes above make it more likely to happen, it
  could happen before.

  In fact, it could even be in state EC_DATAGRAM_INIT, e.g. when the
  master lock was denied in the send attempt. This leads to an
  invalid access to datagram_queue and a crash.

  But we cannot check for EC_DATAGRAM_INIT here because it is also
  set at the very beginning, so EoE processing would never start if
  the function just returned in this state. So I introduced a new
  state EC_DATAGRAM_PREQUEUED, set it in ec_eoe_queue() and check
  for it in ec_eoe_run(). The "sth_to_send" check also tests for
  this state, otherwise a pending datagram whose sending was once
  denied would never be sent.

  (15-ethercat-1.5-eoe-prequeue.patch)

- Another major problem I had was frame corruption.
  As master/device.h says:

  * This memory ring is used to transmit frames. It is necessary to use
  * different memory regions, because otherwise the network device DMA could
  * send the same data twice, if it is called twice.

  Indeed, that's what I saw happening, causing various errors in any
  of the EtherCAT protocols. I found several questionable
  assumptions in the code:

  - EC_BYTE_TRANSMISSION_TIME_NS is set to 80, which is exactly the
    best case time. If anything takes a little longer than best
    case, on a busy EtherCAT interface it's only a matter of time
    until the buffers overrun. I've added a little reserve in
    ec_master_idle_thread() (just like ec_master_set_send_interval()
    did already).

  - The time calculation also didn't consider inter-frame gap and
    frame preambles. I added just the minimum (20 bytes).

  - ec_master_idle_thread() only considered the last frame sent to
    calculate the waiting time. However, ec_master_send_datagrams()
    may send several frames. Therefore, I now have
    ec_master_send_datagrams() compute and return the total number
    of bytes sent (including gaps).

  - If ec_master_send_datagrams() sends more than EC_TX_RING_SIZE
    frames, all waiting time is pointless since it will overwrite
    its own data before there is a chance to sleep.

    BTW, for debugging this problem, I used another PC with 2 NICs
    bridged (using brctl). By running Wireshark on the bridge
    interface, I could see damaged packets coming from the EtherCAT
    master which actually contained copies of (the initial part of)
    the frame after next (easy to identify by the datagram index,
    but also the rest of the data matched) which to me clearly
    confirms that the buffer was overwritten when 2 more frames were
    queued before this one was sent out (with EC_TX_RING_SIZE == 2).

    Therefore, I limit the number of frames it will send at once to
    EC_TX_RING_SIZE.

  - Even with all those changes, I still got corrupted frames
    (though less often than before). So I just increased
    EC_TX_RING_SIZE to 0x10. This is still heuristic, of course, but
    at least I haven't seen any frame corruption since then.

  (16-ethercat-1.5-frame-corruption.patch)

- EoE: The TX frame was not properly cleaned up when the send
  datagram was not received or got no response
  (working_counter != 1). This caused unregister_netdev() to hang
  with the following syslog message repeating forever, and with
  rtnl_mutex held, so the whole networking subsystem remained locked
  when trying to unload the EC module and the system became mostly
  unusable till a reboot.

    unregister_netdevice: waiting for eoe0s1 to become free. Usage count = 4

  (17-ethercat-1.5-eoe-tx-cleanup.patch)

- As mentioned in previous mails, some other serious problems I had
  were about locking:

  - I wonder what is meant to protect access to datagram_queue. The
    comments are not quite clear, but according to master.h, io_sem
    is the "Semaphore used in IDLE phase", and looking at the code I
    figure it is meant to protect datagram_queue in idle phase,
    whereas application-specific locking should do it during
    operation phase.

    However, ec_master_queue_external_datagram() uses io_sem and can
    be called during operation phase, e.g.:
      ec_master_operation_thread()
    ->ec_fsm_slave_exec()
    ->ec_fsm_slave_state_ready()
    ->ec_fsm_slave_action_process_sdo()
    ->ec_master_queue_external_datagram()

    After I backported code from your repository to add locking in
    ec_master_clear_slaves()
    (18-ethercat-1.5-locking-fix-backport.patch), also the following
    sequence became possible:
      ec_master_operation_thread()
    ->ec_fsm_master_exec()
    ->ec_fsm_master_state_broadcast()
    ->ec_master_clear_slaves()

    Also, several places in cdev.c (lines 1840, 1859, 1924, 1943,
    1962, 1983) use io_sem, and cdev can be used during operation
    phase.

    OTOH,
      ec_cdev_ioctl_domain_queue()
    ->ecrt_domain_queue()
    ->ec_master_queue_datagram()
    accesses datagram_queue without acquiring io_sem. It uses
    master_sem instead which seems to be wrong in any case. Using
    io_sem instead at least puts it on the same level as the other
    cdev functions mentioned before.

    (19-ethercat-1.5-io-sem.patch)

    I see that in newer versions (e.g. commit 53b5128e1313), you
    apparently reverted the callback mechanism from send/receive
    callbacks back to lock/unlock callbacks as it was in 1.4. I also
    prefer the latter since they can be used more generally.
    Therefore I made the respective changes in my 1.5 copy too, but
    a little differently. In particular, I use the callbacks also in
    the cdev routines and just anywhere io_sem was used. (io_sem is
    now only used in the default callbacks themselves.)

    (20-ethercat-1.5-callbacks.patch)

  - In examples/rtai you removed the t_critical check completely in
    newer versions (the value is still computed, but never used), so
    a non-RT access that happens at a bad time can now delay the
    execution of the cyclic task. Is this a good idea? What I did
    instead is to have the callbacks check t_critical (as before)
    and sleep (schedule()) when too close. This way they will always
    succeed (as in your version, no return code needed), but cannot
    block the cyclic task (provided timings are computed correctly).

    A fine point is that I now need a flag to tell when the cyclic
    task was stopped. Otherwise, if cleanup took too long, it could
    happen that e.g. stopping EoE would hang forever as it tried to
    get the lock because the "critical" time was already reached and
    the cyclic task would never run again and update the time.

    Also, I think t_last_cycle must be volatile.

    (21-ethercat-1.5-t-critical.patch)

  - However, I still think (as discussed previously) that using RTAI
    semaphores in non-RT tasks (i.e. the callbacks) is wrong.
    According to the RTAI developer, it is necessary that the
    current task is "RT hardened" in order to be able to use RTAI
    semaphores. But since I use the callbacks also from the cdev
    functions now, any Linux process can use them and there is no
    way to ensure that the caller is RT hardened. OTOH, we can't use
    normal kernel semaphores in RTAI code.

    So I now use an atomic flag as a hand-made non-blocking
    semaphore, and each user can wait for it in its own way
    (rt_sleep() for the RTAI task, schedule() in the callbacks).

    (22-ethercat-1.5-rtai-lock.patch)

  - A minor point: ext_queue_sem is meant to protect
    ext_datagram_queue. But it's not used in ecrt_master_send_ext()
    where ext_datagram_queue is accessed. Instead it's used around
    the external send callbacks, but it misses the call from
    ec_master_internal_send_cb(). In the end, it currently doesn't
    matter since both ec_eoe_queue() and send_cb() are only called
    from ec_master_eoe_thread() and are therefore automatically
    serialized, so the semaphore is actually pointless ATM. But if
    it ever gets important, it should be acquired in
    ecrt_master_send_ext() instead.

    (23-ethercat-1.5-ext-queue-sem.patch)

- Though I don't use FoE, I happened to notice the use of a wrong
  wait queue there.

  (24-ethercat-1.5-foe-queue.patch)

- The idle thread doesn't call ec_master_output_stats() regularly,
  so it's only called after a relevant problem, but only outputs
  information once a second, so the remaining statistics are
  swallowed.

  (25-ethercat-1.5-output-stats.patch)

- When a slave's mailbox contains some old data when the master is
  restarted (this happens almost reproducibly when restarting the
  master while it's reading the SDO dictionary), the first mailbox
  response is misinterpreted which (in my case) typically results in
  an error like this:

  EtherCAT ERROR 0-0: Received unknown response while uploading SDO 0x1C12:00.
  EtherCAT ERROR 0-0: Failed to read number of assigned PDOs for SM2.

  Followed by:

  EtherCAT ERROR 0-0: Received upload response for wrong SDO (0x1C12:00, requested: 0x1C13:00).

  To avoid this, I fetch the mailbox once before using it for the
  first time, ignoring any result, whether empty or not.

  (26-ethercat-1.5-clear-mailbox.patch)

- Even after the changes above, several simultaneous CoE (in
  particular SDO) requests can still get mixed up, since they have
  the same protocol number, so my mailbox "dispatcher" doesn't help.

  I must say that I don't really understand the separation between
  master and slave state machines, both of which have their own CoE
  state machines, and the corresponding separation of "internal" and
  "external" datagrams. For what I can tell, both are treated
  completely differently most of the way, but in the end they do
  exactly the same. (Of course, what else? All actual EtherCAT
  communication is between a master and a slave, there is no
  distinct "master CoE" and "slave CoE" protocol.) Problems occur
  when both master and slave state machine try to do CoE operations
  at the same time, because the mailbox responses get mixed up.

  Since I didn't want to make larger changes to the code structure
  now, such as merging those state machines, I changed the code so
  whichever state machine starts a CoE operation has exclusive
  access to CoE for this slave until the operation is finished or
  timed out. (Basically like a semaphore, except the state machines
  run in the same thread, so an actual semaphore would deadlock.)

  The next problem then is that some code (e.g.
  ec_fsm_master_exec()) just assumes that the FSM has a datagram to
  send out in every state, so it always returns 1 unless it's
  waiting for a reply. With my previous change, this isn't the case
  anymore, and it cannot be -- unless I'd block the FSM completely
  while another CoE operation is in progress. (I thought about it,
  but it might degrade performance, if e.g. a longer-running SDO
  transfer in the slave FSM could block unrelated operations in the
  master FSM.) For this reason I introduced a new state
  EC_DATAGRAM_INVALID, set it when the FSM is blocked and make
  ec_fsm_master_exec() return 0 if so (to take care of the master
  FSM), and ec_master_queue_external_datagrams() ignore it in this
  case (to take care of the slave FSM). (No, I don't particularly
  like this solution, but I don't see another way without
  large-scale changes.)

  (27-ethercat-1.5-coe-lock.patch)

- There is no way AFAICS to find out when reading the slaves' SDO
  dictionaries is finished. This not only affects reliable operation
  when one wants to use the dictionary, but also performance, since
  reading is a CoE operation that runs in the master state machine
  and so, even with my previous patch, blocks other CoE (in
  particular, SDO) operations, since
  ec_fsm_master_action_process_sdo() is never reached while the
  dictionary is being read, so it's not reasonable to start an
  application task which uses SDOs in this situation. Therefore I
  added the sdo_dictionary_fetched flag to the state returned by
  ecrt_slave_config_state(), and don't set the flag until the
  reading is finished -- rather than until started, as before, which
  for the other purpose of this flag makes no difference. It's still
  up to the application to request this state and react to it.

  I also added this flag to ec_ioctl_slave_t and let "ethercat sdo"
  report if it's not set. This avoids outputting an incomplete list
  of SDOs, with (usually) a bogus error message
    EtherCAT: ERROR 0-0: SDO entry 0xXXXX:YY does not exist!
  for the entry currently being fetched, and it provides a reliable
  way for start scripts to wait until fetching is completed. (It's
  better for me to do it in a start script than in the application
  module since the master enters operation phase as soon as it is
  requested. Therefore I wait in my script using the output of
  "ethercat sdo", and use the new flag in ecrt_slave_config_state()
  only to let my application module verify that fetching was
  completed.)

  (28-ethercat-1.5-dictionary-fetched.patch)

- The init script needs an "exit" in the "restart" case to avoid
  hitting the error exit at the end.

  (29-ethercat-1.5-init-restart.patch)

- I see you mostly took my code for SDO up-/downloads I suggested
  previously
  (http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html),
  though you take a slave_position instead of an ec_slave_config_t
  parameter. Apparently you did this so you could have the same
  interface in the kernel and user space (though I've always
  wondered why you don't use alias and position in user-space too).
  This raises the question how to get the absolute position, as
  Graeme Foot asked in
  http://lists.etherlab.org/pipermail/etherlab-users/2011/001412.html
  (and got no answer). Well, I'm now using the same work-around
  Graeme described in this mail. Since I only need the
  ecrt_master_get_slave() calls during initialization, the overhead
  doesn't matter much to me, though I don't think it's a really nice
  interface, when used together with the other kernel functions.

  However, I still had to make a few changes:

  - In case of EINTR and also at the end of
    ecrt_master_sdo_download(), I think you forget to clear the
    request (and thus free the allocated memory).

  - The "data" parameter to ecrt_master_sdo_download() should be
    const. (While doing related changes, I noticed a duplicate block
    of code in CommandDownload.cpp; AFAICT the latter copy is
    spurious and my patch removes it.)

  (updated 07-ethercat-1.5-sdo-up-download.patch)

Back then I asked which kinds of operations can be done concurrently
on an EtherCAT master. Now I can finally answer my own question.
With my patches, I can do all of the following at the same time:

- Operations done by the master (bus scan, dictionary fetching, etc.)
- Access through the cdev (e.g. "ethercat upload")
- EoE access
- SDO transfer by a non-realtime kernel module
- SDO transfer by the cyclic task
- PDO transfer in the cyclic task

Regards,
Frank

-- 
Dipl.-Math. Frank Heckenbach <f.heckenbach at fh-soft.de>
Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359
Systems Programming, Software Development, IT Consulting
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ethercat-1.5.0-patches.tar.bz2
Type: application/octet-stream
Size: 18789 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20140605/cffe1d24/attachment-0003.obj>