[etherlab-dev] Hot plugged modules failing to read DC register

Thu Sep 26 03:13:16 CEST 2019

I'm not talking about a reg_read request.

When the slave first goes into SAFEOP+ERROR, the master should detect this during its regular slave scan cycle (fsm_master_state_read_al_status).  It then calls ec_fsm_change_ack to acknowledge the ERROR state - as it does this, it should report something like this to the syslog (even at debug 0 level):

kernel: [49369.927687] EtherCAT ERROR 0-main-2: AL status message 0x001B: "Sync manager watchdog".
kernel: [49369.953541] EtherCAT 0-main-2: Acknowledged state SAFEOP.

This is the message that tells you why the slave was unhappy and went to the ERROR state in the first place.  After this, the slave should be in SAFEOP (no ERROR), and the master will usually then bring it back to INIT and do a full reconfigure and back to OP (if the master app is still running).

If you have the quick-op patch applied, and the AL status was 0x001B and not some other error (and the master hasn't restarted since the slave entered the error state), then instead of going back to INIT in the above case it will try to go straight back from SAFEOP to OP to reduce the downtime.  As mentioned in the patch notes, this is fine in most cases but it can upset some DC slaves (as it may not give them enough time to resync), so you may want to disable this feature via configure.  (You can clearly see in the syslogs when it's attempting a quick return to OP.)

The slave should always accept acknowledging the ERROR state and returning to INIT.

The "slave has error_flag set" thing is different; that's referring to slave->error_flag, which is only set when the slave fails to respond correctly to requests (such as refusing to perform AL state changes or not responding to internal register requests), despite still responding to the broadcast AL state request (used to determine the number of active slaves).  The only way to clear this flag is to rescan, either via "ethercat rescan" or by the automatic rescan that it performs when the number of slaves changes (as it normally does during a genuine comms interruption).

So have a look further back in the logs to see why the error_flag is being set in the first place.

Though also make sure that you're not using any extra patches which disable the automatic rescan.  I recall someone on this list a while ago suggesting this was a good idea; it isn't.

Gavin Lambert
Senior Software Developer

[cid:logo_compac_5dcf97ef-52f5-498c-8b9b-728410ddffaf.png]
[cid:compacicon_82e8a8c7-154a-4a32-9720-a5badb6258e0.png]<http://www.compacsort.com> [cid:facebook_fa85b924-53b9-45cc-8162-0564f64ec3a3.png] <https://www.facebook.com/Compacsort>  [cid:linkedin_4ec016ad-84fa-443c-85a3-b9615a4ccef8.png] <https://www.linkedin.com/company/compac-sorting-equipment/>  [cid:youtube_32142163-fc27-4aed-b14d-e8a377f98a6d.png] <https://vimeo.com/compacsort>  [cid:twitter_d89338d8-98c8-4b65-9a9e-7b1333160b0d.png] <https://twitter.com/compacsort>  [cid:insta2_1cd85de9-b3a2-4971-9904-52b2481a7c82.png] <https://www.instagram.com/compacsort/>

COMPAC SORTING EQUIPMENT LTD | 4 Henderson Pl | Onehunga | Auckland 1061 | New Zealand
Switchboard: +64 96 34 00 88 | tomra.com<http://www.tomra.com>

The information contained in this communication and any attachment is confidential and may be legally privileged. It should only be read by the person(s) to whom it is addressed. If you have received this communication in error, please notify the sender and delete the communication.

From: Graeme Foot <Graeme.Foot at touchcut.com>
Sent: Thursday, 26 September 2019 12:32
To: Gavin Lambert <gavin.lambert at tomra.com>; etherlab-dev at etherlab.org
Subject: RE: Hot plugged modules failing to read DC register

Hi Gavin,

When 0x0910 returns a working counter of 0 the master assumes the slave only supports delay measurement and the slave->has_dc_system_time flag remains zero.  If the flag is zero a few DC configuration steps are missed.  Subsequently, when the slave DC Assign Active value is set the slave has not been set up to use DC correctly and fails to sync.

So without the patch (extracts from attached "DC Sync Error.txt"):
[  270.369125] EtherCAT DEBUG 0-main-3: Ready for requests.
[  270.376908] EtherCAT DEBUG 0-main-3: Slave has no System Time register; delay measurement only.
[  270.393234] EtherCAT WARNING 0-main-3: SII Read Error, EEPROM not loaded.  Retrying...
[  270.476850] EtherCAT 0-main-3: SII EEPROM loaded.  Continuing.
[  270.916606] EtherCAT DEBUG 0-main-3: Unknown category type 0x003C.
[  270.916612] EtherCAT DEBUG 0-main-3: Slave is not in the state to do mailbox com (INIT), setting to PREOP.
[  270.916616] EtherCAT DEBUG 0-main-3: Configuring...
[  270.920595] EtherCAT DEBUG 0-main-3: Now in INIT.
[  270.920598] EtherCAT DEBUG 0-main-3: Clearing FMMU configurations...
[  270.920601] EtherCAT DEBUG 0-main-3: Clearing mailbox check flag...
[  270.922603] EtherCAT DEBUG 0-main-3: Clearing sync manager configurations...
[  270.924595] EtherCAT DEBUG 0-main-3: Configuring mailbox sync managers...
[  270.924600] EtherCAT DEBUG 0-main-3: SM0: Addr 0x1000, Size 128, Ctrl 0x26, En 1
[  270.924605] EtherCAT DEBUG 0-main-3: SM1: Addr 0x1080, Size 128, Ctrl 0x22, En 1
[  270.930589] EtherCAT DEBUG 0-main-3: Now in PREOP.
[  270.930592] EtherCAT DEBUG 0-main-3: Finished configuration.

[  273.639104] EtherCAT WARNING 0-main-3: Slave seems not to support distributed clocks!
[  273.647686] EtherCAT DEBUG 0-main-3: Setting DC cycle times to 1000000 / 0.
[  273.649100] EtherCAT DEBUG 0-main-3: Checking for synchrony.
[  273.743070] EtherCAT DEBUG 0-main-3: Sync after  100 ms: 1828300146 ns
[  273.842996] EtherCAT DEBUG 0-main-3: Sync after  200 ms: 1828303744 ns
...
[  278.440471] EtherCAT DEBUG 0-main-3: Sync after 4800 ms: 1830172951 ns
[  278.540418] EtherCAT DEBUG 0-main-3: Sync after 4900 ms: 1830251113 ns
[  278.640369] EtherCAT WARNING 0-main-3: Slave did not sync after 5000 ms.
[  278.648810] EtherCAT DEBUG 0-main-3: app_start_time=0
[  278.648814] EtherCAT DEBUG 0-main-3:       app_time=289356014683
[  278.648817] EtherCAT DEBUG 0-main-3:     start_time=289456014683
[  278.648820] EtherCAT DEBUG 0-main-3:          cycle=1000000
[  278.648824] EtherCAT DEBUG 0-main-3:     shift_time=500000
[  278.648827] EtherCAT DEBUG 0-main-3:      remainder=14683
[  278.648830] EtherCAT DEBUG 0-main-3:          start=289457500000
[  278.648850] EtherCAT DEBUG 0-main-3: Setting DC cyclic operation start time to 289457500000.
[  278.650379] EtherCAT DEBUG 0-main-3: Setting DC AssignActivate to 0x0300.

[  278.695998] EtherCAT DEBUG 0-main-3: Now in SAFEOP.
[  283.688588] EtherCAT ERROR 0-main-3: Timeout while setting state OP.
[  283.726198] EtherCAT 0: Slave states on main device: PREOP, SAFEOP, OP + ERROR.
[  283.766130] EtherCAT DEBUG 0-main-3: SAFEOP -> SAFEOP + ERROR.

Comparing that to with the patch (extracts from the attached "DC With Patch.txt"):
[   47.873112] EtherCAT DEBUG 0-main-3: Ready for requests.
[   47.873140] EtherCAT DEBUG 0-main-3: Scanning slave 3 on main link.
[   47.881120] EtherCAT WARNING 0-main-3: Slave did not respond to System Time register request, retrying...
[   47.956443] EtherCAT 0-main-3: Slave has the System Time register.
[   48.021036] EtherCAT DEBUG 0-main-3: SII firmware file not found; reading SII data from slave.
[   48.400841] EtherCAT DEBUG 0-main-3: Unknown category type 0x003C.
[   48.400848] EtherCAT DEBUG 0-main-3: Slave is not in the state to do mailbox com (INIT), setting to PREOP.
[   48.400852] EtherCAT DEBUG 0-main-3: Configuring...
[   48.404827] EtherCAT DEBUG 0-main-3: Now in INIT.
[   48.404831] EtherCAT DEBUG 0-main-3: Clearing FMMU configurations...
[   48.404835] EtherCAT DEBUG 0-main-3: Clearing mailbox check flag...
[   48.406831] EtherCAT DEBUG 0-main-3: Clearing sync manager configurations...
[   48.408828] EtherCAT DEBUG 0-main-3: Clearing DC assignment...
[   48.410832] EtherCAT DEBUG 0-main-3: Configuring mailbox sync managers...
[   48.410839] EtherCAT DEBUG 0-main-3: SM0: Addr 0x1000, Size 128, Ctrl 0x26, En 1
[   48.410843] EtherCAT DEBUG 0-main-3: SM1: Addr 0x1080, Size 128, Ctrl 0x22, En 1
[   48.416831] EtherCAT DEBUG 0-main-3: Now in PREOP.
[   48.416836] EtherCAT DEBUG 0-main-3: Finished configuration.

[   49.991966] EtherCAT DEBUG 0-main-3: Checking system time offset.
[   49.993418] EtherCAT DEBUG 0-main-3: DC 64 bit system time offset calculation: system_time=2217603868, app_time=59887013582, diff=57669409714
[   49.993423] EtherCAT DEBUG 0-main-3: Setting time offset to 57669409714 (was 0)

[   51.152316] EtherCAT DEBUG 0-main-3: Setting DC cycle times to 1000000 / 0.
[   51.154320] EtherCAT DEBUG 0-main-3: Checking for synchrony.
[   51.256259] EtherCAT DEBUG 0-main-3: Sync after  100 ms:      14466 ns
[   51.354208] EtherCAT DEBUG 0-main-3: Sync after  200 ms:      14695 ns
...
[   51.955878] EtherCAT DEBUG 0-main-3: Sync after  800 ms:      12253 ns
[   52.053823] EtherCAT DEBUG 0-main-3: Sync after  900 ms:      11113 ns
[   52.133801] EtherCAT DEBUG 0-main-3: 10000 ns difference after 970 ms.
[   52.133806] EtherCAT DEBUG 0-main-3: app_start_time=0
[   52.133810] EtherCAT DEBUG 0-main-3:       app_time=62029021245
[   52.133813] EtherCAT DEBUG 0-main-3:     start_time=62129021245
[   52.133816] EtherCAT DEBUG 0-main-3:          cycle=1000000
[   52.133820] EtherCAT DEBUG 0-main-3:     shift_time=500000
[   52.133823] EtherCAT DEBUG 0-main-3:      remainder=21245
[   52.133826] EtherCAT DEBUG 0-main-3:          start=62130500000
[   52.133829] EtherCAT DEBUG 0-main-3: Setting DC cyclic operation start time to 62130500000.
[   52.135789] EtherCAT DEBUG 0-main-3: Setting DC AssignActivate to 0x0300.

[   52.184538] EtherCAT DEBUG 0-main-3: Now in SAFEOP.
[   52.749441] EtherCAT DEBUG 0-main-3: Now in OP. Finished configuration.

Without the patch there is no AL Error code, just a failure to DC Sync.  Without the slave->has_dc_system_time flag being set the DC time offset is not applied so the slave has no chance to sync.

Without the patch it took the slave approx. 108ms for the SII loading to be complete.  With the patch it took approx. 83ms for the 0x0910 register to respond.  They are generally a similar timeframe so I suspect the slave does not reliably respond while it is still initialising.

Also, restarting the application or manually changing the modules state to INIT or PREOP and back to OP doesn't sort out the flag.  If there has been enough time for the modules DC clock to eventually sync it will reach OP, but if not it will go back to SAFEOP + ERROR.  It takes a "rescan" command to the master to sort out the flag so that the offset calculation can be performed.

I've also found that when the slave is in SAFEOP + ERROR reg_read requests (and pretty much all communications to the slave) fail due to "Aborting register request, slave has error flag set.".  So it's a little hard to manually check if there's any error status set.

Graeme.

From: Gavin Lambert <gavin.lambert at tomra.com<mailto:gavin.lambert at tomra.com>>
Sent: Wednesday, 25 September 2019 5:06 PM
To: Graeme Foot <Graeme.Foot at touchcut.com<mailto:Graeme.Foot at touchcut.com>>; etherlab-dev at etherlab.org<mailto:etherlab-dev at etherlab.org>
Subject: RE: Hot plugged modules failing to read DC register

When the slave goes to safeop+error it should also output an AL error code which might give a hint as to why.  This should be logged to the syslog when the master acknowledges the error.

AL error 0x001B, for example, indicates that the slave stopped receiving SM frames (typical of a comms interruption) - and features/quick-op in the patchset tries to do a quicker recovery for this case by trying to go straight back to OP instead of going through a full PREOP reconfiguration.  It's possible that some slaves may need the full reconfigure, so you can disable this behaviour at configure time.

Other AL error codes mean other things, such as your DC cycle being poorly synced and frames not occurring in a strict SYNC0-SM-SYNC0-SM ordering.

But I wouldn't normally expect any standard registers to fail WC when this occurs, unless perhaps the slave was performing a full power reset (or otherwise holding the slave's ESC in reset).  Though this would interrupt comms to any downstream slaves as well, so it's not something that slaves are supposed to do of their own accord.  (And it shouldn't stay in safeop+error in that case, it should revert to Init, although that's up to the slave implementation.)

90ms seems a bit slow for just an ESC power-on SII read, although it's possible that it's doing something more complicated.

I'm not really familiar with those modules, however; you're probably best off asking Beckhoff directly.

Gavin Lambert
Senior Software Developer

[cid:image001.png at 01D57467.308BA7D0]
[cid:image002.png at 01D57467.308BA7D0]<https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.compacsort.com&data=02%7C01%7Cgavin.lambert%40tomra.com%7C1ff312cb3ca744c8d84608d74219029f%7C4308d118edd143008a37cfeba8ad5898%7C0%7C0%7C637050547545822065&sdata=4v%2FLiit48W4T0CTGsry1ht9XN4rQFjF2sd9oTr0qYko%3D&reserved=0>[cid:image003.png at 01D57467.308BA7D0]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2FCompacsort&data=02%7C01%7Cgavin.lambert%40tomra.com%7C1ff312cb3ca744c8d84608d74219029f%7C4308d118edd143008a37cfeba8ad5898%7C0%7C0%7C637050547545822065&sdata=54WxDplleMlI8eNdg0Ww44QP9ruo7WMzPU59LtZkYSE%3D&reserved=0>[cid:image004.png at 01D57467.308BA7D0]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fcompac-sorting-equipment%2F&data=02%7C01%7Cgavin.lambert%40tomra.com%7C1ff312cb3ca744c8d84608d74219029f%7C4308d118edd143008a37cfeba8ad5898%7C0%7C0%7C637050547545832054&sdata=16LH%2FgzUujQGM73f0tA99iUTZKQqO0lDQ41bVXl440E%3D&reserved=0>[cid:image005.png at 01D57467.308BA7D0]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fvimeo.com%2Fcompacsort&data=02%7C01%7Cgavin.lambert%40tomra.com%7C1ff312cb3ca744c8d84608d74219029f%7C4308d118edd143008a37cfeba8ad5898%7C0%7C0%7C637050547545832054&sdata=qjMN%2B8ZoxIl3as6CW7Ro4IdZUUA1Uro02ONcEKSTgSM%3D&reserved=0>[cid:image006.png at 01D57467.308BA7D0]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fcompacsort&data=02%7C01%7Cgavin.lambert%40tomra.com%7C1ff312cb3ca744c8d84608d74219029f%7C4308d118edd143008a37cfeba8ad5898%7C0%7C0%7C637050547545842052&sdata=oCwVN5X9I%2F6T4PCqVUpqFDN9%2B0HhzjlA%2BoA2euzikAQ%3D&reserved=0>[cid:image007.png at 01D57467.308BA7D0]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fcompacsort%2F&data=02%7C01%7Cgavin.lambert%40tomra.com%7C1ff312cb3ca744c8d84608d74219029f%7C4308d118edd143008a37cfeba8ad5898%7C0%7C0%7C637050547545842052&sdata=V6jixq%2BG%2FgvXrZoR0gCRO8em6kDbusZ5cNo56Zld%2BNQ%3D&reserved=0>

COMPAC SORTING EQUIPMENT LTD | 4 Henderson Pl | Onehunga | Auckland 1061 | New Zealand
Switchboard: +64 96 34 00 88 | tomra.com<https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.tomra.com&data=02%7C01%7Cgavin.lambert%40tomra.com%7C1ff312cb3ca744c8d84608d74219029f%7C4308d118edd143008a37cfeba8ad5898%7C0%7C0%7C637050547545852042&sdata=rWnlGtoCRWYNL8TRJLa400PjLG8n8fwsXOSzJN0Q28U%3D&reserved=0>

The information contained in this communication and any attachment is confidential and may be legally privileged. It should only be read by the person(s) to whom it is addressed. If you have received this communication in error, please notify the sender and delete the communication.

From: Graeme Foot
Sent: Tuesday, 24 September 2019 17:20
To: etherlab-dev at etherlab.org<mailto:etherlab-dev at etherlab.org>
Subject: [etherlab-dev] Hot plugged modules failing to read DC register

Hi,

I've had occasional issues with EL7332 and EL7342 modules where they will go to SafeOp + Error if you try and use them in DC mode.  I've finally had some time to look into it a little further.

When the modules go to SafeOp + Error the master outputs the message "Slave has no System Time register; delay measurement only." (with debug level 1).  This occurs due to the datagram reading register 0x0910 returning a working counter of zero.

I created a quick hack to retry reading the register up to 100 times before failing.  After approx. 90ms the EL7342 module I'm testing with successfully returned the datagram and the slave entered Op state successfully.

In my test setup I also have an EL5101 module that was doing the exact same thing (and taking around the same time), but I've never really had issues with them before.  I suspect the difference is that if you have incorrect settings on the EL7342 module and try to run a motor it can error out and reset itself, causing a situation equivalent to a hot plug.

Without my hack both modules need to wait for the SII read to complete for a similar length of time, so it looks like the slaves do not respond to the 0x0910 register request until the EEPROM read is complete.  Does anyone know if this is expected behaviour, or know of a better solution than to retry reading the register (up to 200ms ???)?

Regards,
Graeme.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 11438 bytes
Desc: image001.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0014.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 1629 bytes
Desc: image002.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0015.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 1750 bytes
Desc: image003.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0016.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 1855 bytes
Desc: image004.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0017.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 1970 bytes
Desc: image005.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0018.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.png
Type: image/png
Size: 20278 bytes
Desc: image006.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0019.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image007.png
Type: image/png
Size: 1506 bytes
Desc: image007.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0020.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logo_compac_5dcf97ef-52f5-498c-8b9b-728410ddffaf.png
Type: image/png
Size: 11438 bytes
Desc: logo_compac_5dcf97ef-52f5-498c-8b9b-728410ddffaf.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0021.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: compacicon_82e8a8c7-154a-4a32-9720-a5badb6258e0.png
Type: image/png
Size: 1629 bytes
Desc: compacicon_82e8a8c7-154a-4a32-9720-a5badb6258e0.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0022.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: facebook_fa85b924-53b9-45cc-8162-0564f64ec3a3.png
Type: image/png
Size: 1750 bytes
Desc: facebook_fa85b924-53b9-45cc-8162-0564f64ec3a3.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0023.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: linkedin_4ec016ad-84fa-443c-85a3-b9615a4ccef8.png
Type: image/png
Size: 1855 bytes
Desc: linkedin_4ec016ad-84fa-443c-85a3-b9615a4ccef8.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0024.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: youtube_32142163-fc27-4aed-b14d-e8a377f98a6d.png
Type: image/png
Size: 1970 bytes
Desc: youtube_32142163-fc27-4aed-b14d-e8a377f98a6d.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0025.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: twitter_d89338d8-98c8-4b65-9a9e-7b1333160b0d.png
Type: image/png
Size: 20278 bytes
Desc: twitter_d89338d8-98c8-4b65-9a9e-7b1333160b0d.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0026.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: insta2_1cd85de9-b3a2-4971-9904-52b2481a7c82.png
Type: image/png
Size: 1506 bytes
Desc: insta2_1cd85de9-b3a2-4971-9904-52b2481a7c82.png
URL: <http://lists.etherlab.org/pipermail/etherlab-dev/attachments/20190926/5963038d/attachment-0027.png>