[etherlab-users] etherlab dc sync check

Thu Jan 2 17:15:45 CET 2014

Hi,

I've prepared a patch for the calculation of system time offset based
on the eterlabmaster stable version 1.5.2 2526:2eff7c993a63. Could you
try the patch to see if it makes your DC sync calibration faster?
Thanks.

Jun

On Wed, Jan 1, 2014 at 10:19 PM, Jun Yuan <j.yuan at rtleaders.com> wrote:
> Hi Raz,
>
> there have been many people raised the same kind of questions like you
> did. Some of them asked in the mailing list, some of them wrote to me
> directly, worrying about those warnings like slave didn't sync after 5
> seconds. For the past two years, I kept answering, that I didn't know
> about the DC sync mechanism very much, that by examining the register
> 0x092c, it can be confirmed the DCs get perfectly synchronized in the
> end anyway, that my customers could get used to obey my rules that
> they must wait several minutes doing nothing until the DCs on the
> EtherCAT bus get synchronized/converged, that maybe it is the slaves’
> fault to have such a slow convergence for their DC.
>
> Frankly speaking, I hate my answers, they are like excuses. So I
> decided to fight them back, and took some time digging into this
> problem for the last two days.
>
> The first thing to do would be learning how the DC sync mechanism
> works. I don't have any official EtherCAT documents, and would be
> appreciate if anyone could send me some of the specifications from
> EtherCAT. On the internet I did find a paper "On the Accuracy of the
> Distributed Clock Mechanism in EtherCAT" and a PPT "Accurate
> Synchronization of EtherCAT Systems Using Distributed Clocks" by
> Joseph E Stubbs. Those two files helped me a lot.
>
> The other obstacle is, I don't have any EtherCAT slave devices at
> hand. Occasionally I receive a project to develop an interface for a
> new sort of slaves using EtherLab Master. Those slaves usually stay
> with me for about two to three weeks, and after that, they will be
> shipped with my software to our customers. The chance to have a slave
> in my office is 1/12, not to mention the deadline pressure from those
> projects. I remember I still owe Florian an apology, as he once asked
> me to test a new feature of the master, but since then I haven't given
> him a reply, because I've been waiting for a slave, expecting that the
> next opportunity to have a slave will come soon, but this didn't
> happen. So I am lack of a testing environment, which could make my
> vision of EtherCAT quite narrowed, and I can’t test my thoughts
> myself.
>
> Alright, here is something I would like to share.
>
> I. The problem with "No app_time received up to now, but master already active."
> I've been always having this error if I don't call
> ecrt_master_application_time() before my realtime cycle loop. I've
> also tried giving a garbage value to the first call of this function
> outside my loop, and it didn't hurt my system at all. This phenomenon
> was recored in my last mails to the mailing list, and the reply from
> Florian is, I shouldn't do that. Well, he is right, because in the
> first call, the app_time will be saved as app_start_time, and then be
> used to calculate the "remainder" correction to the DC start time. By
> calling ecrt_master_application_time() prior to the cycle loop, we
> will give a wrong starting point for DC cyclic operation on the slave.
> I think the end effect will be something like we play with the
> sync0->shift_time, that is, set a shift time to the DC sync0. Although
> this won't hurt us for the most of time, it is not the right way to do
> so.
>
> Where does this warning come from?
> When a master application is running, there would be two threads in
> the system. One is the user realtime cycle loop, the other is the
> EtherCAT-OP thread. These two thread however, are not synchronized
> with each other.
>
> After calling ecrt_master_activate(), the master goes into
> ec_master_operation_thread, which execute further the FSM(finite state
> machine) of the master repeatedly. The cycle time of the EtherCAT-OP
> thread on my machine is 4ms, my linux kernel is running at 250Hz. And
> the function ec_fsm_master_enter_write_system_times will get called
> after several ms, which could be something around 4 to 8 ms, I guess.
>
> If the ecrt_master_application_time() is not be called within that
> time, the master would fail to have a app_time in time, and such an
> error "No app_time" would occur.
>
> In my case, my realtime thread happens to have a cycle time of 4ms.
> And since my loop is like
>
> // first doing some initialization job, which costs 10ms
> while () {
>     wait_for_4_ms();
>     master_receive();
>     ...
>     master_application_time()
>     master_send();
> }
>
> This means, after ecrt_master_activate(), there would be at least 14ms
> passed away before the first master_application_time() in my loop get
> called. The chance for me to have a "No app_time" warning is
> reasonable quite high.
>
> To resolve this problem properly, I can offer two options:
>
> The first option is to change your code: Reduce the initialization
> time, making the time interval between master_activate() and your
> cycle loop as small as possible.
>
> But what if we have a large cycle time, say 16ms? Our cycle loop will
> wait 16 ms anyway before the first master_application_time() get
> called, which could be too late  for the EtherCAT-OP thread. So my
> second option is, to change the code of EtherCAT master. And the
> simplest way for me to do so, is to add a "return;" after the line
>             EC_MASTER_WARN(master, "No app_time received up to now,"
>                     " but master already active.\n");
> in master/fsm_master.c. This would force the master FSM to wait until
> it has got an app_time.
>
> Note that I don't have the possibility to do the test. So please
> change your etherlab master code, check it out on your system, and
> give everybody a feedback if it works.
>
>
> II. The problem with "Slave did not sync after 5000 ms"
> This is a little bit more complicated. In short, IMHO, it is the
> master who should take the responsibility to this problem.
>
> Concerning the DC sync, there are 3 phases.
> Phase 1. Measure the transmission delays t_delay to each slave.
> Phase 2. Calculate the system time offset t_offset for each slave.
> Phase 3. Drift compensation, where the slave will adjust their local
> DC to have dt = (t_local + t_offset - t_delay) -
> t_received_system_time go to 0.
>
> The first phase will be executed during the bus scanning in the
> function ec_fsm_master_state_scan_slave() -> ec_master_calc_dc() ->
> ec_master_calc_transmission_delays() -> ec_slave_calc_port_delays().
> It seems that the EtherLab master measure this for only once. Well we
> could argue that, measuring the transmission delay for several times
> and get its average could generate a better estimation. Until now, my
> experiences tell me these values don’t vary much, and it seems the
> EtherLab master is doing good. But I will be appreciate if anyone
> would like to do the „bus rescan“ thing many times on the same set of
> EtherCAT bus, check if the delay_to_next_dc of all the slaves change
> too much each times of the bus scan. If it is so, changes must be made
> to have several measurements instead of only one in the source of
> etherLab master.
>
> At the beginning of the year 2013, I encountered a phenomenon, which
> has been written in my last emails, when I tried to correct it but
> failed in the end. This phenomenon in my observation one year ago, is
> that, after the bus has reached a stable state for all the DCs, a
> restart of the master application would cause a wrongly change of
> approx. 4ms to the system_time_offset of the ref clock, and later  the
> ec_fsm_slave_config_state_dc_sync_check() of the ref slave shows that
> there are around 4ms errors between the master clock to the slave
> clock at the beginning. This certainly demonstrates the weakness of
> the current EtherLab master in the second phase, that the calculation
> of the t_offset is not alright.
>
> Since the t_offset is given wrongly to the slaves by the master, the
> difference dt = (t_local + t_offset - t_delay) -
> t_received_system_time for the drift compensation becomes too large at
> its beginning. In my humble opinion, the EtherLab master might have
> abused the functionality of the drift compensation mechanism to
> compensate its failure in the accurate calculation of the system time
> offset t_offset.
>
> What is the matter with the time offset?
> Let’s have look at the procedure of time offset calculation:
> 1. The master FSM prepares a ec_datagram_fprd(fsm->datagram,
> fsm->slave->station_address,                    0x0910, 24) to read
> out the system time of the slave.
> 2. The user realtime cycle loop sends out the datagram while calling
> ecrt_master_send.
> 3. The next ecrt_master_receive fetches the answer.
> 4. The master FSM read the datagram and calculate the time offset.
>
> Take an example, we have a master FMS EtherCAT-OP thread running in a
> loop of 4ms, and a user realtime application thread running at 1ms.
> Let’s define the time the step 1 happens is x ms. And the user loop
> runs 0.5ms after the EtherCAT-OP.
>
> The following would happen:
> Time : Event
> x    ms: Step 1, FSM prepares an FPRD datagram to 0x0910
> x+0.5ms: Step 2, user loop sets a new app_time; the FPRD datagram gets
> sent out, the sending timestamp jiffies is stored in
> datagram->jiffies_sent;
> x+1.5ms: Step 3, user loop sets a new app_time; the datagram is
> received, the receiving timestamp jiffies is stored in
> datagram->jiffies_received;
> x+2.5ms: user loop sets a new app_time;
> x+3.5ms: user loop sets a new app_time;
> x+4  ms: Step 4, FSM calculate the time offset.
>
> And here is the source code in ec_fsm_master_dc_offset64()
>
>     // correct read system time by elapsed time since read operation
>     correction = (u64) (jiffies_since_read * 1000 / HZ) * 1000000;
>     system_time += correction;
>     time_diff = fsm->slave->master->app_time - system_time;
>
> The jiffies is a counter in Linux kernel which get increased by 1 in a
> frequency defined by HZ. I have a 250 Hz linux system, so the 1
> jiffies means 4 ms. As jiffies_sent was taken when the master clock is
> x+0.5ms, and the current jiffies value is taken at x+4ms. We have a
> possibility of 0.5/4 = 12.5% that the jiffies don’t increase itself
> during that 3.5ms time, and 87.5% possibility that the jiffies has
> been increased by 1. This means the value „correction“ would have a
> typical value of 4000000ns, occasionally being 0 ns.
>
> Let’s assume that the slave DC has been perfectly synchronized with
> the master app time. So the system_time from the slave equals to
> 0.5ms(the time the FPRD datagram was sent). With correction added,
> system_time = x+4.5ms or x+0.5ms.
>
> The app_time is x+3.5ms at the time of the Step 4..
>
> time_diff = app_time - system_time = -1000000ns for the most of the
> time, and around 2000000ns occasionally, depending on the correction .
>
> See, the time_diff should actually be 0, not -1ms or 2ms, as we said,
> the slave DC is perfectly synchronized with the master app time.
>
> You may argue that the -1ms error isn’t that too much, but this error
> will typically goes to around -4ms if the user realtime cycle loop is
> running every 4ms, as in my case one year ago.
>
> Where comes the error in the calculation?
> Two reasons:
> 1. jiffies have a bad resolution of 4ms in a linux system of 250Hz.
> 2. app_time is not the time when Step 4 is executed.
>
> While using get_cycles() instead of jiffies could be able to improve
> the accuracy of the correction, the fact that app_time is not the
> current master system time would still drags errors into time offset.
>
> Why do we need "correction" here at all? Because the app_time in Step
> 4 is not the app_time of the slave system time reading.
>
> The key is to have the correct app_time the FPRD datagram 0x0910 is
> sent, and use that app_time to calculate the time_diff, without any
> correction any more of course.
>
> I know, it is easier said than done. Right now I have two ideas for the master.
> The first idea: add a new variable app_time_sent to the ec_datagram_t
> struct. write down the app_time when each datagram get sent. time_diff
> = datagram->app_time_sent - system_time(0x0910);
>
> The second solution is a little bit tricky: triggers the calculation
> by the user realtime cycle loop. i.e. we may check the fsm_datagram in
> ecrt_master_receive() or even in ecrt_master_application_time() when
> the last app_time is still there. If we find out it is a FPRD 0x0910
> datagram, we do the calculation right away using the old app_time.
>
> I think the first idea would be easier to implement.
>
>
> Besides the inaccurate calculation of the time offset, the other issue
> in the EtherLab master that bothers me is, it seems to me that the
> drift compensation is working at the same time when the new system
> time offset is
> calculated and sent to the slaves, as the drift compensation is in the
> user realtime cycle loop and the t_offset calculation is the
> EtherCAT-OP. Shouldn’t we get the offset calculation be done first,
> before sending ref_sync_datagram to the ref clock and sync_datagram to
> the other slaves? Won’t the drift compensation algorithm of the slaves
> have any effects on its local DC time (by slowing or fastening the
> clock), which then effects the t_offset calculation? Since phase 2 and
> 3 happens simultaneously, won’t the sudden change of the
> t_offset(which causes a sudden change of dt) causes some sort of
> disturbance to the drift compensation algorithm on the slave?
>
> I think we may need a boolean, set by the FSM to tell the user thread
> whether phase 2 is done, the user thread only calls
> ecrt_master_sync_reference_clock(master) and
> ecrt_master_sync_slave_clocks(master) when the correct system time
> offset for each slaves have been sent to the slaves.
>
>
>
> Sorry to have written such a long email, I hope I’ve made my thoughts
> clear.  I could be wrong in many different places, I’ll be very happy
> if somebody could change the EtherLab master code the way as I
> mentioned and test it for me.
>
>
> Wish all of you a Happy New Year!
>
> Jun
>
> On Mon, Dec 30, 2013 at 2:32 PM, Raz <raziebe at gmail.com> wrote:
>> Hey
>>
>> At the moment it takes a long time to calibrate the dc. aprox 5 seconds
>> for each slave.  I am setting up a system which is supposed to control
>> over 12 axes and the calibration duration reaches a minute.
>>
>> Is it possible to reduce this time ?
>>
>>
>> --
>> https://sites.google.com/site/ironspeedlinux/
>>
>> _______________________________________________
>> etherlab-users mailing list
>> etherlab-users at etherlab.org
>> http://lists.etherlab.org/mailman/listinfo/etherlab-users
>>
>
>
>
> --
> Jun Yuan
> [Aussprache: Djün Üän]
>
> Robotics Technology Leaders GmbH
> Am Loferfeld 58, D-81249 München
> Tel: +49 89 189 0465 24
> Mobile: +49 176 2176 5238
> Fax: +49 89 189 0465 11
> mailto: j.yuan at rtleaders.com
>
> Umlautregel in der chinesischen Lautschrift Pinyin: Nach den Anlauten
> y, j, q, und x wird u als ü ausgesprochen, z.B. yu => ü,  ju => dschü,
>  qu => tschü,  xu => schü.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: time_offset_2526.patch
Type: text/x-patch
Size: 5955 bytes
Desc: not available
URL: <http://lists.etherlab.org/pipermail/etherlab-users/attachments/20140102/cc532f19/attachment-0005.bin>