[etherlab-users] etherlab dc sync check
Jun Yuan
j.yuan at rtleaders.com
Thu Jan 2 23:22:02 CET 2014
I'm glad to hear the voice from someone who has a deep understanding
of the Phased Locked Loop on the slave side :)
I agree that the convergence time should not depend on the number of
the slaves. Maybe Raz can give us some log files showing the detail of
the sync convergence on each slave.
If I understand it correctly,
the FPWR to 0x0910 is used to synchronize the ref clock with the
master system clock, and
the FPMW to 0x0910 is used to synchronize the other slave clocks with
the ref clock.
Yet the FPWR command don’t have to be sent to the ref clock at the
same frequency of the FPMW. They may not always be in the same
EtherCAT frame.
WRT the update rate, what do you think would be the minimum rate that
could have PPL filter still stable? I suppose sending FPMW every 4ms
alone won’t make PPL unstable, will it?
I agree that the jitter from the master shall not propagate throughout
the whole network, since the reference clock is there working like a
wonderful filter. And my point is, if the system time offset given to
the reference clock is not accurate, when the FPWR comes, the
reference clock would see a large difference between its local time
and the master time. The ref clock would then need a long time using
its drift compensation mechanism to adjust its local clock until that
difference converges to zero. I suppose the curve of the difference
during the whole adjustment process would be something like a step
response for a second order system, with overshoot, followed by
ringing. And since the difference was high at the beginning, the ref
clock would do some large adjustment work, having a large overshoot
and a long settling time.
And during the ref clock self adjustment, before the ref clock finally
gets settled to a relatively stable state, its ref signal in the FPMW
datagram to the other slave would be something strange to the other
slaves. The other slaves would see a ref clock signal running very
fast in this moment, and becomes very slow in the next, due to those
ringing in the ref clock adjustment. Besides that, since their time
offsets to the ref clock are not accurate too, they also get a large
difference for them to compensate. Well, I don’t know how long it will
take to have them finally get settled.
On Thu, Jan 2, 2014 at 5:37 PM, Jeroen Van den Keybus
<jeroen.vandenkeybus at gmail.com> wrote:
> I'll have a look at this too, but, at the moment, I find it strange that the
> convergence time (PLL lock time) would depend on the number of slaves.
>
> Normally, the master sends its 64-bit system time on a regular basis to the
> first slave in the chain using FPWR to register 0x0910. This slave's PLL
> compares the received time with its local time and adjusts its rate entirely
> on its own (not the offset, that's a one-time job for the master).
>
> In the same EtherCAT frame, the master also sends the FRMW command to the
> first slave to have that slave broadcast/copy its time from 0x0910 to all
> other slaves on 0x0910. They will therefore synchronize to slave number 1,
> just as slave number 1 synchronizes to the master (PC). This way, the jitter
> from the master (Ethernet frame does not leave the PC's NIC exactly on the
> time stored in the frame, but the frame leaving slave 1 does) does not
> propagate throughout the whole system. So I would not expect convergence
> time to scale with the number of slaves at all, but rather remain constant.
>
> WRT the update rate, you must update at a minimum rate or the PLL loop
> filter will be unstable. If you update faster, you can tighten the PLL loop
> parameters to reduce lock time, but the minimum update rate for stability
> will also increase.
>
> I have been working with an own implementation of a PLL which requires 30s
> to fully lock. I'll check with a 'regular' terminal to verify.
>
>
> J.
>
>
> 2014/1/2 Raz <raziebe at gmail.com>
>>
>> Hey Jun
>>
>> If you look deep into the Documentation you notice that they say that dc
>> sync is influenced by the number of packets sent. Here are my benchmarks
>> for a 6 slaves system when powering it up.
>>
>> 1. 1ms sent_interval for Op_thread with 4ms transmit interval . 33
>> seconds
>> 2. 500us sent_interval for op_therad with 500us transmit interval. 6
>> seconds
>> 4. 100us sent_interval for op_thread with 100us transmit interval. 2.5
>> seconds.
>>
>> I believe the reason is the accuracy which is better in shorter intervals.
>>
>>
>>
>>
>>
>> On Wed, Jan 1, 2014 at 11:19 PM, Jun Yuan <j.yuan at rtleaders.com> wrote:
>>>
>>> Hi Raz,
>>>
>>> there have been many people raised the same kind of questions like you
>>> did. Some of them asked in the mailing list, some of them wrote to me
>>> directly, worrying about those warnings like slave didn't sync after 5
>>> seconds. For the past two years, I kept answering, that I didn't know
>>> about the DC sync mechanism very much, that by examining the register
>>> 0x092c, it can be confirmed the DCs get perfectly synchronized in the
>>> end anyway, that my customers could get used to obey my rules that
>>> they must wait several minutes doing nothing until the DCs on the
>>> EtherCAT bus get synchronized/converged, that maybe it is the slaves’
>>> fault to have such a slow convergence for their DC.
>>>
>>> Frankly speaking, I hate my answers, they are like excuses. So I
>>> decided to fight them back, and took some time digging into this
>>> problem for the last two days.
>>>
>>> The first thing to do would be learning how the DC sync mechanism
>>> works. I don't have any official EtherCAT documents, and would be
>>> appreciate if anyone could send me some of the specifications from
>>> EtherCAT. On the internet I did find a paper "On the Accuracy of the
>>> Distributed Clock Mechanism in EtherCAT" and a PPT "Accurate
>>> Synchronization of EtherCAT Systems Using Distributed Clocks" by
>>> Joseph E Stubbs. Those two files helped me a lot.
>>>
>>> The other obstacle is, I don't have any EtherCAT slave devices at
>>> hand. Occasionally I receive a project to develop an interface for a
>>> new sort of slaves using EtherLab Master. Those slaves usually stay
>>> with me for about two to three weeks, and after that, they will be
>>> shipped with my software to our customers. The chance to have a slave
>>> in my office is 1/12, not to mention the deadline pressure from those
>>> projects. I remember I still owe Florian an apology, as he once asked
>>> me to test a new feature of the master, but since then I haven't given
>>> him a reply, because I've been waiting for a slave, expecting that the
>>> next opportunity to have a slave will come soon, but this didn't
>>> happen. So I am lack of a testing environment, which could make my
>>> vision of EtherCAT quite narrowed, and I can’t test my thoughts
>>> myself.
>>>
>>> Alright, here is something I would like to share.
>>>
>>> I. The problem with "No app_time received up to now, but master already
>>> active."
>>> I've been always having this error if I don't call
>>> ecrt_master_application_time() before my realtime cycle loop. I've
>>> also tried giving a garbage value to the first call of this function
>>> outside my loop, and it didn't hurt my system at all. This phenomenon
>>> was recored in my last mails to the mailing list, and the reply from
>>> Florian is, I shouldn't do that. Well, he is right, because in the
>>> first call, the app_time will be saved as app_start_time, and then be
>>> used to calculate the "remainder" correction to the DC start time. By
>>> calling ecrt_master_application_time() prior to the cycle loop, we
>>> will give a wrong starting point for DC cyclic operation on the slave.
>>> I think the end effect will be something like we play with the
>>> sync0->shift_time, that is, set a shift time to the DC sync0. Although
>>> this won't hurt us for the most of time, it is not the right way to do
>>> so.
>>>
>>> Where does this warning come from?
>>> When a master application is running, there would be two threads in
>>> the system. One is the user realtime cycle loop, the other is the
>>> EtherCAT-OP thread. These two thread however, are not synchronized
>>> with each other.
>>>
>>> After calling ecrt_master_activate(), the master goes into
>>> ec_master_operation_thread, which execute further the FSM(finite state
>>> machine) of the master repeatedly. The cycle time of the EtherCAT-OP
>>> thread on my machine is 4ms, my linux kernel is running at 250Hz. And
>>> the function ec_fsm_master_enter_write_system_times will get called
>>> after several ms, which could be something around 4 to 8 ms, I guess.
>>>
>>> If the ecrt_master_application_time() is not be called within that
>>> time, the master would fail to have a app_time in time, and such an
>>> error "No app_time" would occur.
>>>
>>> In my case, my realtime thread happens to have a cycle time of 4ms.
>>> And since my loop is like
>>>
>>> // first doing some initialization job, which costs 10ms
>>> while () {
>>> wait_for_4_ms();
>>> master_receive();
>>> ...
>>> master_application_time()
>>> master_send();
>>> }
>>>
>>> This means, after ecrt_master_activate(), there would be at least 14ms
>>> passed away before the first master_application_time() in my loop get
>>> called. The chance for me to have a "No app_time" warning is
>>> reasonable quite high.
>>>
>>> To resolve this problem properly, I can offer two options:
>>>
>>> The first option is to change your code: Reduce the initialization
>>> time, making the time interval between master_activate() and your
>>> cycle loop as small as possible.
>>>
>>> But what if we have a large cycle time, say 16ms? Our cycle loop will
>>> wait 16 ms anyway before the first master_application_time() get
>>> called, which could be too late for the EtherCAT-OP thread. So my
>>> second option is, to change the code of EtherCAT master. And the
>>> simplest way for me to do so, is to add a "return;" after the line
>>> EC_MASTER_WARN(master, "No app_time received up to now,"
>>> " but master already active.\n");
>>> in master/fsm_master.c. This would force the master FSM to wait until
>>> it has got an app_time.
>>>
>>> Note that I don't have the possibility to do the test. So please
>>> change your etherlab master code, check it out on your system, and
>>> give everybody a feedback if it works.
>>>
>>>
>>> II. The problem with "Slave did not sync after 5000 ms"
>>> This is a little bit more complicated. In short, IMHO, it is the
>>> master who should take the responsibility to this problem.
>>>
>>> Concerning the DC sync, there are 3 phases.
>>> Phase 1. Measure the transmission delays t_delay to each slave.
>>> Phase 2. Calculate the system time offset t_offset for each slave.
>>> Phase 3. Drift compensation, where the slave will adjust their local
>>> DC to have dt = (t_local + t_offset - t_delay) -
>>> t_received_system_time go to 0.
>>>
>>> The first phase will be executed during the bus scanning in the
>>> function ec_fsm_master_state_scan_slave() -> ec_master_calc_dc() ->
>>> ec_master_calc_transmission_delays() -> ec_slave_calc_port_delays().
>>> It seems that the EtherLab master measure this for only once. Well we
>>> could argue that, measuring the transmission delay for several times
>>> and get its average could generate a better estimation. Until now, my
>>> experiences tell me these values don’t vary much, and it seems the
>>> EtherLab master is doing good. But I will be appreciate if anyone
>>> would like to do the „bus rescan“ thing many times on the same set of
>>> EtherCAT bus, check if the delay_to_next_dc of all the slaves change
>>> too much each times of the bus scan. If it is so, changes must be made
>>> to have several measurements instead of only one in the source of
>>> etherLab master.
>>>
>>> At the beginning of the year 2013, I encountered a phenomenon, which
>>> has been written in my last emails, when I tried to correct it but
>>> failed in the end. This phenomenon in my observation one year ago, is
>>> that, after the bus has reached a stable state for all the DCs, a
>>> restart of the master application would cause a wrongly change of
>>> approx. 4ms to the system_time_offset of the ref clock, and later the
>>> ec_fsm_slave_config_state_dc_sync_check() of the ref slave shows that
>>> there are around 4ms errors between the master clock to the slave
>>> clock at the beginning. This certainly demonstrates the weakness of
>>> the current EtherLab master in the second phase, that the calculation
>>> of the t_offset is not alright.
>>>
>>> Since the t_offset is given wrongly to the slaves by the master, the
>>> difference dt = (t_local + t_offset - t_delay) -
>>> t_received_system_time for the drift compensation becomes too large at
>>> its beginning. In my humble opinion, the EtherLab master might have
>>> abused the functionality of the drift compensation mechanism to
>>> compensate its failure in the accurate calculation of the system time
>>> offset t_offset.
>>>
>>> What is the matter with the time offset?
>>> Let’s have look at the procedure of time offset calculation:
>>> 1. The master FSM prepares a ec_datagram_fprd(fsm->datagram,
>>> fsm->slave->station_address, 0x0910, 24) to read
>>> out the system time of the slave.
>>> 2. The user realtime cycle loop sends out the datagram while calling
>>> ecrt_master_send.
>>> 3. The next ecrt_master_receive fetches the answer.
>>> 4. The master FSM read the datagram and calculate the time offset.
>>>
>>> Take an example, we have a master FMS EtherCAT-OP thread running in a
>>> loop of 4ms, and a user realtime application thread running at 1ms.
>>> Let’s define the time the step 1 happens is x ms. And the user loop
>>> runs 0.5ms after the EtherCAT-OP.
>>>
>>> The following would happen:
>>> Time : Event
>>> x ms: Step 1, FSM prepares an FPRD datagram to 0x0910
>>> x+0.5ms: Step 2, user loop sets a new app_time; the FPRD datagram gets
>>> sent out, the sending timestamp jiffies is stored in
>>> datagram->jiffies_sent;
>>> x+1.5ms: Step 3, user loop sets a new app_time; the datagram is
>>> received, the receiving timestamp jiffies is stored in
>>> datagram->jiffies_received;
>>> x+2.5ms: user loop sets a new app_time;
>>> x+3.5ms: user loop sets a new app_time;
>>> x+4 ms: Step 4, FSM calculate the time offset.
>>>
>>> And here is the source code in ec_fsm_master_dc_offset64()
>>>
>>> // correct read system time by elapsed time since read operation
>>> correction = (u64) (jiffies_since_read * 1000 / HZ) * 1000000;
>>> system_time += correction;
>>> time_diff = fsm->slave->master->app_time - system_time;
>>>
>>> The jiffies is a counter in Linux kernel which get increased by 1 in a
>>> frequency defined by HZ. I have a 250 Hz linux system, so the 1
>>> jiffies means 4 ms. As jiffies_sent was taken when the master clock is
>>> x+0.5ms, and the current jiffies value is taken at x+4ms. We have a
>>> possibility of 0.5/4 = 12.5% that the jiffies don’t increase itself
>>> during that 3.5ms time, and 87.5% possibility that the jiffies has
>>> been increased by 1. This means the value „correction“ would have a
>>> typical value of 4000000ns, occasionally being 0 ns.
>>>
>>> Let’s assume that the slave DC has been perfectly synchronized with
>>> the master app time. So the system_time from the slave equals to
>>> 0.5ms(the time the FPRD datagram was sent). With correction added,
>>> system_time = x+4.5ms or x+0.5ms.
>>>
>>> The app_time is x+3.5ms at the time of the Step 4..
>>>
>>> time_diff = app_time - system_time = -1000000ns for the most of the
>>> time, and around 2000000ns occasionally, depending on the correction .
>>>
>>> See, the time_diff should actually be 0, not -1ms or 2ms, as we said,
>>> the slave DC is perfectly synchronized with the master app time.
>>>
>>> You may argue that the -1ms error isn’t that too much, but this error
>>> will typically goes to around -4ms if the user realtime cycle loop is
>>> running every 4ms, as in my case one year ago.
>>>
>>> Where comes the error in the calculation?
>>> Two reasons:
>>> 1. jiffies have a bad resolution of 4ms in a linux system of 250Hz.
>>> 2. app_time is not the time when Step 4 is executed.
>>>
>>> While using get_cycles() instead of jiffies could be able to improve
>>> the accuracy of the correction, the fact that app_time is not the
>>> current master system time would still drags errors into time offset.
>>>
>>> Why do we need "correction" here at all? Because the app_time in Step
>>> 4 is not the app_time of the slave system time reading.
>>>
>>> The key is to have the correct app_time the FPRD datagram 0x0910 is
>>> sent, and use that app_time to calculate the time_diff, without any
>>> correction any more of course.
>>>
>>> I know, it is easier said than done. Right now I have two ideas for the
>>> master.
>>> The first idea: add a new variable app_time_sent to the ec_datagram_t
>>> struct. write down the app_time when each datagram get sent. time_diff
>>> = datagram->app_time_sent - system_time(0x0910);
>>>
>>> The second solution is a little bit tricky: triggers the calculation
>>> by the user realtime cycle loop. i.e. we may check the fsm_datagram in
>>> ecrt_master_receive() or even in ecrt_master_application_time() when
>>> the last app_time is still there. If we find out it is a FPRD 0x0910
>>> datagram, we do the calculation right away using the old app_time.
>>>
>>> I think the first idea would be easier to implement.
>>>
>>>
>>> Besides the inaccurate calculation of the time offset, the other issue
>>> in the EtherLab master that bothers me is, it seems to me that the
>>> drift compensation is working at the same time when the new system
>>> time offset is
>>> calculated and sent to the slaves, as the drift compensation is in the
>>> user realtime cycle loop and the t_offset calculation is the
>>> EtherCAT-OP. Shouldn’t we get the offset calculation be done first,
>>> before sending ref_sync_datagram to the ref clock and sync_datagram to
>>> the other slaves? Won’t the drift compensation algorithm of the slaves
>>> have any effects on its local DC time (by slowing or fastening the
>>> clock), which then effects the t_offset calculation? Since phase 2 and
>>> 3 happens simultaneously, won’t the sudden change of the
>>> t_offset(which causes a sudden change of dt) causes some sort of
>>> disturbance to the drift compensation algorithm on the slave?
>>>
>>> I think we may need a boolean, set by the FSM to tell the user thread
>>> whether phase 2 is done, the user thread only calls
>>> ecrt_master_sync_reference_clock(master) and
>>> ecrt_master_sync_slave_clocks(master) when the correct system time
>>> offset for each slaves have been sent to the slaves.
>>>
>>>
>>>
>>> Sorry to have written such a long email, I hope I’ve made my thoughts
>>> clear. I could be wrong in many different places, I’ll be very happy
>>> if somebody could change the EtherLab master code the way as I
>>> mentioned and test it for me.
>>>
>>>
>>> Wish all of you a Happy New Year!
>>>
>>> Jun
>>>
>>> On Mon, Dec 30, 2013 at 2:32 PM, Raz <raziebe at gmail.com> wrote:
>>> > Hey
>>> >
>>> > At the moment it takes a long time to calibrate the dc. aprox 5 seconds
>>> > for each slave. I am setting up a system which is supposed to control
>>> > over 12 axes and the calibration duration reaches a minute.
>>> >
>>> > Is it possible to reduce this time ?
>>> >
>>> >
>>> > --
>>> > https://sites.google.com/site/ironspeedlinux/
>>> >
>>> > _______________________________________________
>>> > etherlab-users mailing list
>>> > etherlab-users at etherlab.org
>>> > http://lists.etherlab.org/mailman/listinfo/etherlab-users
>>> >
>>>
>>>
>>>
>>> --
>>> Jun Yuan
>>> [Aussprache: Djün Üän]
>>>
>>> Robotics Technology Leaders GmbH
>>> Am Loferfeld 58, D-81249 München
>>> Tel: +49 89 189 0465 24
>>> Mobile: +49 176 2176 5238
>>> Fax: +49 89 189 0465 11
>>> mailto: j.yuan at rtleaders.com
>>>
>>> Umlautregel in der chinesischen Lautschrift Pinyin: Nach den Anlauten
>>> y, j, q, und x wird u als ü ausgesprochen, z.B. yu => ü, ju => dschü,
>>> qu => tschü, xu => schü.
>>
>>
>>
>>
>> --
>> https://sites.google.com/site/ironspeedlinux/
>>
>> _______________________________________________
>> etherlab-users mailing list
>> etherlab-users at etherlab.org
>> http://lists.etherlab.org/mailman/listinfo/etherlab-users
>>
>
--
Jun Yuan
[Aussprache: Djün Üän]
Robotics Technology Leaders GmbH
Am Loferfeld 58, D-81249 München
Tel: +49 89 189 0465 24
Mobile: +49 176 2176 5238
Fax: +49 89 189 0465 11
mailto: j.yuan at rtleaders.com
Umlautregel in der chinesischen Lautschrift Pinyin: Nach den Anlauten
y, j, q, und x wird u als ü ausgesprochen, z.B. yu => ü, ju => dschü,
qu => tschü, xu => schü.
More information about the Etherlab-users
mailing list