[etherlab-users] etherlab dc sync check

Wed Jan 1 22:19:48 CET 2014

Hi Raz,

there have been many people raised the same kind of questions like you
did. Some of them asked in the mailing list, some of them wrote to me
directly, worrying about those warnings like slave didn't sync after 5
seconds. For the past two years, I kept answering, that I didn't know
about the DC sync mechanism very much, that by examining the register
0x092c, it can be confirmed the DCs get perfectly synchronized in the
end anyway, that my customers could get used to obey my rules that
they must wait several minutes doing nothing until the DCs on the
EtherCAT bus get synchronized/converged, that maybe it is the slaves’
fault to have such a slow convergence for their DC.

Frankly speaking, I hate my answers, they are like excuses. So I
decided to fight them back, and took some time digging into this
problem for the last two days.

The first thing to do would be learning how the DC sync mechanism
works. I don't have any official EtherCAT documents, and would be
appreciate if anyone could send me some of the specifications from
EtherCAT. On the internet I did find a paper "On the Accuracy of the
Distributed Clock Mechanism in EtherCAT" and a PPT "Accurate
Synchronization of EtherCAT Systems Using Distributed Clocks" by
Joseph E Stubbs. Those two files helped me a lot.

The other obstacle is, I don't have any EtherCAT slave devices at
hand. Occasionally I receive a project to develop an interface for a
new sort of slaves using EtherLab Master. Those slaves usually stay
with me for about two to three weeks, and after that, they will be
shipped with my software to our customers. The chance to have a slave
in my office is 1/12, not to mention the deadline pressure from those
projects. I remember I still owe Florian an apology, as he once asked
me to test a new feature of the master, but since then I haven't given
him a reply, because I've been waiting for a slave, expecting that the
next opportunity to have a slave will come soon, but this didn't
happen. So I am lack of a testing environment, which could make my
vision of EtherCAT quite narrowed, and I can’t test my thoughts
myself.

Alright, here is something I would like to share.

I. The problem with "No app_time received up to now, but master already active."
I've been always having this error if I don't call
ecrt_master_application_time() before my realtime cycle loop. I've
also tried giving a garbage value to the first call of this function
outside my loop, and it didn't hurt my system at all. This phenomenon
was recored in my last mails to the mailing list, and the reply from
Florian is, I shouldn't do that. Well, he is right, because in the
first call, the app_time will be saved as app_start_time, and then be
used to calculate the "remainder" correction to the DC start time. By
calling ecrt_master_application_time() prior to the cycle loop, we
will give a wrong starting point for DC cyclic operation on the slave.
I think the end effect will be something like we play with the
sync0->shift_time, that is, set a shift time to the DC sync0. Although
this won't hurt us for the most of time, it is not the right way to do
so.

Where does this warning come from?
When a master application is running, there would be two threads in
the system. One is the user realtime cycle loop, the other is the
EtherCAT-OP thread. These two thread however, are not synchronized
with each other.

After calling ecrt_master_activate(), the master goes into
ec_master_operation_thread, which execute further the FSM(finite state
machine) of the master repeatedly. The cycle time of the EtherCAT-OP
thread on my machine is 4ms, my linux kernel is running at 250Hz. And
the function ec_fsm_master_enter_write_system_times will get called
after several ms, which could be something around 4 to 8 ms, I guess.

If the ecrt_master_application_time() is not be called within that
time, the master would fail to have a app_time in time, and such an
error "No app_time" would occur.

In my case, my realtime thread happens to have a cycle time of 4ms.
And since my loop is like

// first doing some initialization job, which costs 10ms
while () {
    wait_for_4_ms();
    master_receive();
    ...
    master_application_time()
    master_send();
}

This means, after ecrt_master_activate(), there would be at least 14ms
passed away before the first master_application_time() in my loop get
called. The chance for me to have a "No app_time" warning is
reasonable quite high.

To resolve this problem properly, I can offer two options:

The first option is to change your code: Reduce the initialization
time, making the time interval between master_activate() and your
cycle loop as small as possible.

But what if we have a large cycle time, say 16ms? Our cycle loop will
wait 16 ms anyway before the first master_application_time() get
called, which could be too late  for the EtherCAT-OP thread. So my
second option is, to change the code of EtherCAT master. And the
simplest way for me to do so, is to add a "return;" after the line
            EC_MASTER_WARN(master, "No app_time received up to now,"
                    " but master already active.\n");
in master/fsm_master.c. This would force the master FSM to wait until
it has got an app_time.

Note that I don't have the possibility to do the test. So please
change your etherlab master code, check it out on your system, and
give everybody a feedback if it works.

II. The problem with "Slave did not sync after 5000 ms"
This is a little bit more complicated. In short, IMHO, it is the
master who should take the responsibility to this problem.

Concerning the DC sync, there are 3 phases.
Phase 1. Measure the transmission delays t_delay to each slave.
Phase 2. Calculate the system time offset t_offset for each slave.
Phase 3. Drift compensation, where the slave will adjust their local
DC to have dt = (t_local + t_offset - t_delay) -
t_received_system_time go to 0.

The first phase will be executed during the bus scanning in the
function ec_fsm_master_state_scan_slave() -> ec_master_calc_dc() ->
ec_master_calc_transmission_delays() -> ec_slave_calc_port_delays().
It seems that the EtherLab master measure this for only once. Well we
could argue that, measuring the transmission delay for several times
and get its average could generate a better estimation. Until now, my
experiences tell me these values don’t vary much, and it seems the
EtherLab master is doing good. But I will be appreciate if anyone
would like to do the „bus rescan“ thing many times on the same set of
EtherCAT bus, check if the delay_to_next_dc of all the slaves change
too much each times of the bus scan. If it is so, changes must be made
to have several measurements instead of only one in the source of
etherLab master.

At the beginning of the year 2013, I encountered a phenomenon, which
has been written in my last emails, when I tried to correct it but
failed in the end. This phenomenon in my observation one year ago, is
that, after the bus has reached a stable state for all the DCs, a
restart of the master application would cause a wrongly change of
approx. 4ms to the system_time_offset of the ref clock, and later  the
ec_fsm_slave_config_state_dc_sync_check() of the ref slave shows that
there are around 4ms errors between the master clock to the slave
clock at the beginning. This certainly demonstrates the weakness of
the current EtherLab master in the second phase, that the calculation
of the t_offset is not alright.

Since the t_offset is given wrongly to the slaves by the master, the
difference dt = (t_local + t_offset - t_delay) -
t_received_system_time for the drift compensation becomes too large at
its beginning. In my humble opinion, the EtherLab master might have
abused the functionality of the drift compensation mechanism to
compensate its failure in the accurate calculation of the system time
offset t_offset.

What is the matter with the time offset?
Let’s have look at the procedure of time offset calculation:
1. The master FSM prepares a ec_datagram_fprd(fsm->datagram,
fsm->slave->station_address,                    0x0910, 24) to read
out the system time of the slave.
2. The user realtime cycle loop sends out the datagram while calling
ecrt_master_send.
3. The next ecrt_master_receive fetches the answer.
4. The master FSM read the datagram and calculate the time offset.

Take an example, we have a master FMS EtherCAT-OP thread running in a
loop of 4ms, and a user realtime application thread running at 1ms.
Let’s define the time the step 1 happens is x ms. And the user loop
runs 0.5ms after the EtherCAT-OP.

The following would happen:
Time : Event
x    ms: Step 1, FSM prepares an FPRD datagram to 0x0910
x+0.5ms: Step 2, user loop sets a new app_time; the FPRD datagram gets
sent out, the sending timestamp jiffies is stored in
datagram->jiffies_sent;
x+1.5ms: Step 3, user loop sets a new app_time; the datagram is
received, the receiving timestamp jiffies is stored in
datagram->jiffies_received;
x+2.5ms: user loop sets a new app_time;
x+3.5ms: user loop sets a new app_time;
x+4  ms: Step 4, FSM calculate the time offset.

And here is the source code in ec_fsm_master_dc_offset64()

    // correct read system time by elapsed time since read operation
    correction = (u64) (jiffies_since_read * 1000 / HZ) * 1000000;
    system_time += correction;
    time_diff = fsm->slave->master->app_time - system_time;

The jiffies is a counter in Linux kernel which get increased by 1 in a
frequency defined by HZ. I have a 250 Hz linux system, so the 1
jiffies means 4 ms. As jiffies_sent was taken when the master clock is
x+0.5ms, and the current jiffies value is taken at x+4ms. We have a
possibility of 0.5/4 = 12.5% that the jiffies don’t increase itself
during that 3.5ms time, and 87.5% possibility that the jiffies has
been increased by 1. This means the value „correction“ would have a
typical value of 4000000ns, occasionally being 0 ns.

Let’s assume that the slave DC has been perfectly synchronized with
the master app time. So the system_time from the slave equals to
0.5ms(the time the FPRD datagram was sent). With correction added,
system_time = x+4.5ms or x+0.5ms.

The app_time is x+3.5ms at the time of the Step 4..

time_diff = app_time - system_time = -1000000ns for the most of the
time, and around 2000000ns occasionally, depending on the correction .

See, the time_diff should actually be 0, not -1ms or 2ms, as we said,
the slave DC is perfectly synchronized with the master app time.

You may argue that the -1ms error isn’t that too much, but this error
will typically goes to around -4ms if the user realtime cycle loop is
running every 4ms, as in my case one year ago.

Where comes the error in the calculation?
Two reasons:
1. jiffies have a bad resolution of 4ms in a linux system of 250Hz.
2. app_time is not the time when Step 4 is executed.

While using get_cycles() instead of jiffies could be able to improve
the accuracy of the correction, the fact that app_time is not the
current master system time would still drags errors into time offset.

Why do we need "correction" here at all? Because the app_time in Step
4 is not the app_time of the slave system time reading.

The key is to have the correct app_time the FPRD datagram 0x0910 is
sent, and use that app_time to calculate the time_diff, without any
correction any more of course.

I know, it is easier said than done. Right now I have two ideas for the master.
The first idea: add a new variable app_time_sent to the ec_datagram_t
struct. write down the app_time when each datagram get sent. time_diff
= datagram->app_time_sent - system_time(0x0910);

The second solution is a little bit tricky: triggers the calculation
by the user realtime cycle loop. i.e. we may check the fsm_datagram in
ecrt_master_receive() or even in ecrt_master_application_time() when
the last app_time is still there. If we find out it is a FPRD 0x0910
datagram, we do the calculation right away using the old app_time.

I think the first idea would be easier to implement.

Besides the inaccurate calculation of the time offset, the other issue
in the EtherLab master that bothers me is, it seems to me that the
drift compensation is working at the same time when the new system
time offset is
calculated and sent to the slaves, as the drift compensation is in the
user realtime cycle loop and the t_offset calculation is the
EtherCAT-OP. Shouldn’t we get the offset calculation be done first,
before sending ref_sync_datagram to the ref clock and sync_datagram to
the other slaves? Won’t the drift compensation algorithm of the slaves
have any effects on its local DC time (by slowing or fastening the
clock), which then effects the t_offset calculation? Since phase 2 and
3 happens simultaneously, won’t the sudden change of the
t_offset(which causes a sudden change of dt) causes some sort of
disturbance to the drift compensation algorithm on the slave?

I think we may need a boolean, set by the FSM to tell the user thread
whether phase 2 is done, the user thread only calls
ecrt_master_sync_reference_clock(master) and
ecrt_master_sync_slave_clocks(master) when the correct system time
offset for each slaves have been sent to the slaves.

Sorry to have written such a long email, I hope I’ve made my thoughts
clear.  I could be wrong in many different places, I’ll be very happy
if somebody could change the EtherLab master code the way as I
mentioned and test it for me.

Wish all of you a Happy New Year!

Jun

On Mon, Dec 30, 2013 at 2:32 PM, Raz <raziebe at gmail.com> wrote:
> Hey
>
> At the moment it takes a long time to calibrate the dc. aprox 5 seconds
> for each slave.  I am setting up a system which is supposed to control
> over 12 axes and the calibration duration reaches a minute.
>
> Is it possible to reduce this time ?
>
>
> --
> https://sites.google.com/site/ironspeedlinux/
>
> _______________________________________________
> etherlab-users mailing list
> etherlab-users at etherlab.org
> http://lists.etherlab.org/mailman/listinfo/etherlab-users
>

-- 
Jun Yuan
[Aussprache: Djün Üän]

Robotics Technology Leaders GmbH
Am Loferfeld 58, D-81249 München
Tel: +49 89 189 0465 24
Mobile: +49 176 2176 5238
Fax: +49 89 189 0465 11
mailto: j.yuan at rtleaders.com

Umlautregel in der chinesischen Lautschrift Pinyin: Nach den Anlauten
y, j, q, und x wird u als ü ausgesprochen, z.B. yu => ü,  ju => dschü,
 qu => tschü,  xu => schü.