[etherlab-users] Error reassigning removed PDO

Thu May 29 11:24:27 CEST 2014

It’s mostly a master problem I think, although some of the worst misbehaviour requires particular functionality in the slave (which may be rarer).

The main problem that I’ve personally run into recently (and coded my own workaround for, just a few minutes ago) was from this scenario:

1.       Master starts up, starts doing slave scanning.

2.       Application starts up, calls ecrt_request_master, which waits for slave scanning to complete before returning.

3.       Application sets up basic configuration and calls ecrt_master_activate.

4.       Slaves wind their way up to OP.

5.       Meanwhile in the background the master starts reading the CoE dictionary and getting entry descriptions to fill in the names.  (This takes quite a long time.)

6.       Application decides something is screwy while this is still happening and calls ecrt_master_release and unloads the master module.

7.       Since the master stops dead when this happens, occasionally it has just sent a CoE Info request to a slave but abandoned waiting for the response.  The response is still sitting there in the slave’s mailbox.  The slaves have dropped back to SAFEOP+ERROR because they’re no longer receiving data.

8.       The master service and application are reloaded.

9.       The initial scan sees the slaves at >= PREOP so merely acknowledges the error and leaves them at SAFEOP, then starts to read SM+PDOs.

10.   When it gets to the slave that had a stale SDO Info response in its mailbox (which is still there, because the slave was never sent back to INIT), it gets confused because it wasn’t the SDO 0x1C12 data response it was expecting (because it had just sent the request); it aborts the request and assumes 0 PDOs in that SM.  Hilarity ensues, as I’ve already outlined below.

(This can also occur if the network is disconnected but not unpowered at any time during the CoE dictionary scan, then reconnected later.)

Note that it’s reasonable for the scan to not reset to INIT, because rescans can occur during operation (although having said that, I haven’t looked too closely at whether this disrupts anything).  But I think it’s definitely a master-side bug that it can’t cope with stale responses – that’s just something you always have to expect with mailboxes, especially when there are timeouts involved as well.

My workaround was to change the CoE FSM to check for and discard any stale data in the mailbox prior to beginning any CoE operation.  It seemed to resolve the above issue in a very basic test, but I’ll hopefully know more after a more thorough one tomorrow.

It’s not an ideal solution, of course; the underlying problem (which I hinted at below, and posted in more detail about several months ago) is that the Etherlab code assumes that only one thing is going on in the mailboxes at a time, and so only checks them when it’s expecting a response and throws its virtual hands up when it finds something other than what it wanted.  This is particularly noticeable if a slave sends asynchronous notifications, or can process multiple mailbox protocols in parallel (both of which are allowed in the standards).  The most common types of these are CoE emergencies and EoE.  And woe betide you if the master happens to be handling a FoE request when an emergency arrives, or a CoE request when an EoE packet arrives, etc.

Ideally the master should have some sort of central dispatcher which is constantly watching mailboxes and handing off incoming data to the protocol state machines as they arrive.  Often this can even be done for “free” – many slaves provide a dedicated “MBoxState” FMMU that can be used to watch for new mailbox messages as part of the regular process datagram, avoiding the need to individually poll the slaves.

From: Jun Yuan [mailto:j.yuan at rtleaders.com] 
Sent: Thursday, 29 May 2014 20:40
To: Gavin Lambert
Cc: etherlab-users at etherlab.org
Subject: Re: [etherlab-users] Error reassigning removed PDO

Hello Gavin,

for that specific part of the CoE transfer problem you mentioned, I may have observed the same problem, and I did some analysis on it. This is actually a big problem, makes the master quite unreliable for me. I have a temporary fix for it. But I don't know who should be responsible for this CoE mailbox bug. Is it the master? Is it the slave? or is it a design error in the EtherCAT standard for the mailbox? I'll write another email to elaborate the problem with the flaky CoE mailbox.

Regards,
Jun

On 29 May 2014 09:37, Gavin Lambert <gavinl at compacsort.com> wrote:

Last month, I wrote:
> TLDR: when reassigning PDOs, why doesn't the master read mappings from
> the slave via CoE?
[...]
> Shouldn't this scenario work?  The PDO is always specified in the SII,
> even if not presently in PDO Assign, so the master ought to know that it
> exists.
> And failing that, it could just try to read the mappings directly from
> the slave (if CoE is available) when unable to load default mapping from
> its cache.  (I think part of the problem is that the CoE data appears to
> be replacing the SII data in the master's PDO cache.)
>
> I'm also a little puzzled as to why (if it wants to have a cache of PDO
> mappings) it seems to limit itself to reading only the currently
> assigned PDOs during the initial scan, instead of fetching all of them.
> They shouldn't be hard to find -- they can be identified purely by their
> index.

There's a further problem with this that I've since discovered: if, during
the master's scan of the PDO assignment registers, something goes wrong with
the CoE transfer of 0x1C1x:0, then the master will log an error but proceed
anyway under the assumption that the slave has 0 PDOs assigned in that SM.
If this is not contradicted by the application using ecrt_slave_config_pdos
(including both assigns and mappings, because it read no default mappings),
then the master will *write 0 back* to the PDO assignment register (if
writable) on activate.

This guarantees that the next scan will not find any PDOs, unless the slave
reloads the default assignments during INIT (and with my "slave author" hat
on, all advice I can find says that slaves should not do that, although I
couldn't find official word).

So basically it all seems to point to applications being unreliable (at
least for flexible-assignment slaves) unless they use ecrt_slave_config_pdos
to configure *everything* (including mappings, even for fixed-mapping
slaves).  Which makes me wonder why it bothers scanning for PDO assignments
at all.  Doesn't that just waste time if apps have to use
ecrt_slave_config_pdos anyway?

Given how flaky mailbox handling is in general (as previously mentioned),
I'm surprised this hasn't come up more often.

_______________________________________________
etherlab-users mailing list
etherlab-users at etherlab.org
http://lists.etherlab.org/mailman/listinfo/etherlab-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.etherlab.org/pipermail/etherlab-users/attachments/20140529/d0516d32/attachment-0003.htm>