[etherlab-dev] ec_lock_* vs. ec_ioctl_lock in master/ioctl.c

Mon Mar 5 10:58:00 CET 2018

Hi Graeme

Need to get one thing straight first.

RTDM user-space is not the same as "plain" Linux user-space.
I know that you are 100% aware that, but your replies to me seems less
clear about it.

Your 0002-eoe-via-rtdm.patch does not seem usable for "plain" user-space
applications, so does not apply to the applications we support.

>> > These 4 functions are special.  The master should be written (and 
>> > mostly seems to be) in a fashion that between a ecrt_master_activate() 
>> > and ecrt_master_deactivate() the above calls do not require any locks 
>> > to synchronize with the master code, except for the EOE thread which 
>> > uses callbacks.
>> 
>> Well, mostly is the same as not in my world.  If they mostly do not require
>> any locks, they do require locks.
>
> When I say mostly, I mean the vanilla Etherlab version is ok, but the patch
> that added ec_ioctl_lock_up() and ec_ioctl_lock_down() broke this.  I do not
> use this patch.

Do you mean that the patches (and I guess you mean
0017-Master-locks-to-avoid-corrupted-datagram-queue.patch and
0018-Use-call-back-functions.patch) introduces the need for locking?

That does not sound right to me.  As I read them (and your writing in
this thread), the need for locking is there with and without these
patches.  Without them, locking needs to be handled at application
level.  With the patches, the master tries to handle the locking.

And that is a clearly a design change, and one which you clearly does
not agree with.

But if you really mean that these patches broke etherlabmaster (and not
"just" changed the design), would you please explain it a bit more
detailed?

>> As for the EOE thread, I might be overlooking something obvious.  But how
>> are you supposed to use callbacks when using the library API?
>> 
>> As far as I read the code, if you are using EoE and not RTDM, you will
>> always use the standard ec_master_internal_send_cb() and
>> ec_master_internal_receive_cb() callbacks.  See ec_ioctl_activate().
>
> If you do not set EoE callbacks (with ecrt_master_callbacks()) there will be
> NO callbacks once the master goes active.  (It will not use
> ec_master_internal_send_cb() and ec_master_internal_receive_cb().  It will not
> start an EoE processing thread.  You will get a "No EoE processing because of
> missing callbacks!" message in your system log.)

In ec_ioctl_activate() there is:

#ifndef EC_IOCTL_RTDM
    ecrt_master_callbacks(master, ec_master_internal_send_cb,
            ec_master_internal_receive_cb, master);
#endif

So at least for (non RTDM) user-space applications, that activate the
master with ioctl, EoE callbacks are always set to
ec_master_internal_send_cb() and ec_master_internal_receive_cb().

> If your application is in kernel space then you can specify callback functions
> and then use application level locking.  If your application is in user space,
> then you cannot specify callback functions as the kernel cannot call back into
> user space.  This is why I created my latest patch "0002-eoe-via-rtdm.patch".
> This patch lets you create your own EoE processing thread in your own
> application.  This of course also lets you use application level locking.

But only for RTDM.  The functions added are guarded by

#if !defined(__KERNEL__) && defined(EC_RTDM) && (EC_EOE)

which means that they do not apply when using non-RTDM user-space.

>> And with them, I don't see how you are supposed to do locking at the
>> application level.  I will not be able to specify application level locks
>> that are used by both application and eoe, unless they are implemented
>> directly in the master (kernel).
>> 
>> And with ec_master_eoe_thread() running inside master, you really need to do
>> locking to synchronize access to master->datagram_queue (i.e. the io_sem
>> lock).  If not, you will have race conditions between eoe and 3 of the 4
>> functions:
>
> The io_sem lock is only used by the master when it is idle.  Once you activate
> the master it is up to your application to provide the locking.

Ok.  But I dare to say that it is an open question if that is an design
for etherlabmaster.  I believe it has both pros and cons.

Pro: Application developers can implement synchronization as they find
most optimal for them.

Cons: Application developers need to figure out and implement
synchronization without much (any) help from the etherlabmaster code.
And getting synchronization is IMHO something that is often tricky to
get right (at least for many developers).

>>    ecrt_master_send()
>>    ecrt_master_receive()
>>    ecrt_domain_queue()
>> 
>> So we really do not to do locking in at least these 3 when using EOE.
>> 
>> Or maybe the whole master->datagram_queue should be refactored to something
>> that can be synchronized safely in a lock-free way?
>
> The point of the callback functions is so that you can make sure your
> application does not call any of ecrt_master_receive(), ecrt_master_send(),
> ecrt_master_send_ext(), ecrt_domain_process(), ecrt_domain_queue() for a given
> master at the same time.  If you don't use EoE you don't need the callbacks,
> but you still need to protect ecrt_master_receive(),
> ecrt_master_send(),ecrt_domain_process(), ecrt_domain_queue().

Which I personally see as a trivial omission in the etherlabmaster
design (requiring all API users to spend time and code writing the same
code for protect the API from itself).

>> > And the reason it uses callbacks is because only your application 
>> > knows if it is appropriate to allow the EOE thread to call
>> > ecrt_master_recieve() and ecrt_master_send_ext() etc at any particular 
>> > time.
>> 
>> I see what you mean.  But I just don't see how it is possible to accomplish
>> this for user space applications.
>> 
>
> Exactly why I create my patch "0002-eoe-via-rtdm.patch". 

And again, which does not extend to non-RTDM user-space.

>> > However, if your application has multiple send/receive loops then they 
>> > need to be synchronized with each other (see next comment).  There are 
>> > a few more functions such as the distributed clock calls that are also 
>> > in the send/receive loops.  They also do not require ethercat level 
>> > locks as they should be safe to call between the activate and 
>> > deactivate.  They also do not need application level locking as they 
>> > should really only be called by one send/receive loop per master.  All 
>> > other ethercat function calls should not be in a hard realtime context.
>> 
>> Ok.  But you imply a struct rule for how to design applications doing
>> EtherCAT communication: "There can only be one send/receive loop per
>> master".
>> 
>> That might (is not in my case) accepted by all application developers.
>> I need to continue support for applications that use multiple send/receive
>> loops for multiple EtherCAT domains, each with different cycle time.
>> 
>> One solution could be to create a single send/receive loop per master, and
>> then let the multiple application loops communicate with that instead of
>> directly with the master.  That might actually be a really good idea.  But,
>> if that is done, I think it should fit very well as a general feature of the
>> EtherCAT master.  It should be possible to implement without any application
>> specific code.
>
> That's not what I said.  You can have multiple send/receive loops per master,
> but it is up to your application to make sure they don't call any of the ecrt
> functions I listed at the top at the same time (per master).  If your
> application is a single process with multiple threads you can use an
> application semaphore (for example), but if it is multiple processes you will
> need a named semaphore (or similar) that all of the processes share.

For the single process, multiple threads application, it is not that
bad.  It is requring all such application developers to do the same over
and over again, wasting time and introducing the same bugs again and
again.

The multiple process world can be different though.  You basically end
up with a new EtherCAT API.  A combination of the etherlabmaster API and
a custom named semaphore API.  Without this, applications will not work
properly together.  Why not include such a feature directly in
etherlabmaster?  Without it, I think we are making the user-space
applications (non-RTDM) into a second-class citizen.

It might not matter to you, but as this how we use etherlabmaster, it
matters to me.

>> The final "problem" is as you say a design question.  Should etherlabmaster
>> support synchronization of access from multiple threads/processes, or should
>> this be left entirely to the application developer.  But if
>> master->datagram_queue is changed to something lock-free, this "problem" 
>> master->will
>> likely be solved also.
>> 
>> >> > However, if they are being called from multiple processes (rather 
>> >> > than
>> >> > threads) then you need to use something like a named semaphore so 
>> >> > that all processes share the same lock.  Of course if you are using 
>> >> > callbacks (for EoE) you are probably doing that anyway.
>> >> 
>> >> You easily have We have multiple processes.  Even with just a single 
>> >> application, you have EtherCAT-OP, EtherCAT-EoE and the application.
>> >> All using some of the same shared data structures.  Throwing more 
>> >> multiple application just adds to that, but I think it is critical 
>> >> enough with just a single application.
>> >
>> > Even if you have multiple processes (rather than threads), it is your 
>> > design decision as to which process takes priority and whether a 
>> > particular send/receive loop should wait a bit even if it could get 
>> > the lock now.  You can only do that if your application controls the
>> > locking of these functions.
>> 
>> While that sounds reasonable, I believe it is not entirely true.  You could
>> write a layer responsible for handling this, which gets information from the
>> application(s) for how to deal with this.  Think of it like a kind of
>> scheduler.
>> 
>> Example from an application I have seen.  You have two applications
>> (processes), each driving it's own EtherCAT domain.  One application/domain
>> is running every 2 ms, the other application/domain every 10 ms.  In
>> addition, EoE slaves are also used.  For each application, you could specify
>> the priority, the cycle time and the start time.  EoE should also be given a
>> priority.  This EtherCAT "scheduler" would then be able to decide which
>> EtherCAT communication to do and when.
>> 
>> So while I agree that it might be most obvious to implement the decission of
>> when to send/receive directly in the application, I think it could also be
>> implemented in a more generic way.
>
> You don't need the EtherCAT master to do that scheduling.  That is already
> available in Linux / RTAI or whatever you use.

What do you mean?

> But once one of the above ecrt function calls is in progress you need
> to ensure no other process / thread will call one at the same time.

Yes, that is your preferred design (that **I** need to ensure that).  I
prefer a design where the master will do that for me (actually the
application developers using our platform).

>> >> > They should probably also be turned off for RTAI / Xenomi in 
>> >> > general and as I said above use application level locking.
>> >> >
>> >> > You can pass --enable-rtdm when compiling to enable RTDM (and 
>> >> > compile rtdm-ioctl.o).
>> >> 
>> >> Passing --enable-rtdm to ./configure will define EC_RTDM macro and 
>> >> enable the automake conditional ENABLE_RTDM.
>> >> 
>> >> This will trigger Kbuild to build rtdm-ioctl.o (from rtdm-ioctl.c), 
>> >> and this will be done with EC_IOCTL_RTDM macro defined.
>> >> 
>> >> But ioctl.c will be compiled as always, and that is without 
>> >> EC_IOCTL_RTDM defined.  Was it supposed to be defined?  If so, it 
>> >> should be easy to fix, but someone should definitely do some real testing
>> >> of it.
>> >
>> > I think the locks should be disabled in both "rtdm-ioctl.c" and 
>> > "ioctl.c" if using RTAI, but I use RTDM so haven't confirmed this.
>> 
>> I don't see how it could ever be disabled for ioctl.c with the current code.
>
> By not using the patch that added them.

Of-course.  But back to the same discussion again.

>> > Further to that I don't think they should be there at all.  Simple 
>> > applications have one send/receive loop so don't need locks.
>> > Applications with multiple send/receive loops or EOE need to control 
>> > their own locking for optimal results anyway.
>> 
>> Again, I don't fully agree with you on that.
>> 
>> But more importantly, it is not possible for user-space applications in
>> combination with EoE, due to inability to set application callbacks.
>> 
>> > Also, having these lock functions use master_sem the send/receive 
>> > functions block unnecessarily with non-realtime ethercat function 
>> > calls.  At a minimum they should be locking on their own semaphore.
>> 
>> True.  I have patches for fixing that.  The master_sem is definitely a no-go
>> for real-time.
>
> So, I believe the following two patches are the problem:
> - base/0017-Master-locks-to-avoid-corrupted-datagram-queue.patch
> - base/0018-Use-call-back-functions.patch

Well, that depends on how you look at it.  Removing those patches
removes the problem from etherlabmaster.  But then I just have to do
exactly the same thing on top of etherlabmaster, and I will be back to
square one.

> My new EoE patch rolls back the changes from
> "base/0018-Use-call-back-functions.patch" and I use EC_IOCTL_RTDM so ignore
> the changes in
> "base/0017-Master-locks-to-avoid-corrupted-datagram-queue.patch".

Ok.  I think you could have been a bit more explicit about that
roll-back.

/Esben