[etherlab-dev] Master re-request race with slave mailboxes

Wed Aug 13 05:49:49 CEST 2014

Hi,

Is it expected that after ecrt_request_master(), all online slaves are in
PREOP (or possibly stuck in INIT with error_flag=1)?  Or is an application
expected to explicitly verify the state of all slaves before trying to do
anything?

In the continuing saga of fun with mailbox SDOs, I've found that even with
Frank's or Knud's patches to reduce mailbox contention, there are still some
issues that stem from the slave state not being as expected.

In my application, on startup it requests the master and then uses
ecrt_master_sdo_upload to fetch certain information from slaves (eg.
profile, version, etc), both for diagnostics and to help ensure the config
is sane.  While this normally works fine, there can be problems if it occurs
too soon after the master service is started or after it was last released.

In particular, when the master is deactivated or released it will internally
schedule a transition back to PREOP for all slaves.  If the master is
re-requested too quickly, then this may not have even started yet, and since
SDO requests are disallowed (and the request state machines not processed)
during slave reconfiguration, it can end up doing two consecutive writes
(first the upload request from the application, then a retry or occasionally
something involving 0x1C12 and 0x1C13).  Firstly, this can result in the
second request to fail due to an unexpected response and consequently fail
the entire slave configuration (unless retried as in Frank's patches), and
secondly this will result in the application request timing out (because the
request machine is paused in a state where it just sent the request and then
resumed thinking that it just needs to wait for the reply, but in the
meantime the mailbox has been reset out from under it).

And of course this also means that currently when ecrt_request_master()
returns, some slaves may still be in a non-PREOP state pending transition to
PREOP, so it is not possible to rely on accessing SDOs that are "preop only"
- although this probably isn't a big problem as most of those will probably
be used with ecrt_slave_config_sdo* instead, which is safer.

Another interesting quirk that I noticed along the way is that
ecrt_request_master() will internally wait on master->config_busy - but this
is toggled (and waitqueue released) in between each slave, so even if slave
configuration has started, ecrt_request_master() will block only until it
finishes configuring the current slave and then return to the application
while configuration continues in the background; this seems of dubious
usefulness to me.  ("slave configuration" here refers to returning the
slaves to PREOP.)

I'm happy to look at writing some patches to resolve this behaviour, but
before I do that it seemed like a good idea to ask which behaviour is more
correct (in the view of the community):

1. Everything is working as expected (no patches are required), and it's the
application's responsibility to wait for the slave to return to PREOP before
using ecrt_master_sdo_{down,up}load.

2. ecrt_request_master() should block until all slaves finish returning to
PREOP, not just whichever one slave happens to be in progress at the time.
(Sub-decision: should it be the open or the reserve that blocks?  Currently
it's only the latter.)

3. ecrt_master_deactivate() (and consequently ecrt_release_master() too)
should block until all slaves finish returning to PREOP.  (This won't help
with initial startup happening too early.)

4. Don't allow configuration to start while a request is still in progress,
but then do the configuration before starting the *next* request.  (This
won't help with ensuring it's in PREOP before requesting, but will prevent
the mailbox mixup and timeout.)

5. Something else that I did not think of.

(Note that where I say "return to PREOP" above, this also applies to the
initial change to PREOP if the application is started too soon after the
master module is loaded.)

Thoughts?

(Hopefully this doesn't bias the responses too much, but I'm slightly
leaning towards #4, as this would uniformly apply to all types of requests
from all sources [command-line, blocking API, asynch API], and is likely to
be a step closer to structural improvement of the state machines.  It's a
little weaker in not assuring PREOP, but *usually* SDOs are always readable
and the write-in-PREOP-only SDOs should be handled via
ecrt_slave_config_sdo* as noted above.  The main problem with this [and why
one of the other options might be better] is that it could still try [and
fail] to transfer while the slave is in INIT, in the case when the app is
started too soon after the master, so #1 or #2 may be needed anyway.)

Regards,
Gavin Lambert