[etherlab-users] r8169 patch - packet timeout boot failures

Tue Dec 3 13:00:39 CET 2013

driver never fails to load because i am getting timeout errors.  I did not
crash/panic, but some sort of a system lockup ( various user space net
daemons stopped the boot process, probably because the interface bringup
was in some error state, i think is some netlink socket hanging ). This is
when i added the spin locks which appear to stop the hang.
I think that once i patch the e1000e , we might have some more knowledge of
why this is happening. Please note that this problem happens in my various
intel pc boards and is not bounded to a single type of board.

On Tue, Dec 3, 2013 at 1:50 PM, Jeroen Van den Keybus <
jeroen.vandenkeybus at gmail.com> wrote:

> Just a thought since you mention booting: is it possible your driver is
> sometimes simply loaded before the master is (and fails to register) ? You
> mention that you crashed upon boot without the spinlocks and the only way
> to do that should be that you run as a regular netdev device (line 4170)
> incl. irqs. Could also explain why the e1000 has the problem.
>
> I suspect that adding the link status check merely causes an extra delay
> which could lead to the master being loaded earlier.
>
> J.
>
>
> 2013/12/3 Raz <raziebe at gmail.com>
>
>> All i am doing is more of a trial and error. I do not know the realtek
>> driver at all.
>> The spinlock are needed because they are protected in the original driver
>> code flow . i had a boot lockup in one of my trials without them.  This
>> patch does not eliminate the problem entirely, but from 10 trials with 6
>> drives with a 100% failures to 1 out of 10 I believe it important enough to
>> mail to the community. as for e1000e i do not know what the problem is, i
>> need to check it and email you.
>>
>>
>>
>> On Tue, Dec 3, 2013 at 1:16 PM, Jeroen Van den Keybus <
>> jeroen.vandenkeybus at gmail.com> wrote:
>>
>>> Why the spinlock ? This driver instance shouldn't ever be reentering.
>>>
>>> I'm a bit worried that it would complicate the use of e.g. RTAI and
>>> Xenomai.
>>>
>>> How comes the e1000 has the same issue ?
>>>
>>> J.
>>>
>>>
>>>
>>> 2013/12/3 Raz <raziebe at gmail.com>
>>>
>>>> The bellow patch seemed to eliminate the problem. I believe the problem
>>>> relates to resetting some registers when link up is detected.
>>>>
>>>> diff --git a/local_src/r8169-3.2/r8169.c b/local_src/r8169-3.2/r8169.c
>>>> index 6df1793..a483fb5 100644
>>>> --- a/local_src/r8169-3.2/r8169.c
>>>> +++ b/local_src/r8169-3.2/r8169.c
>>>> @@ -1290,6 +1290,9 @@ static void __rtl8169_check_link_status(struct
>>>> net_device *dev,
>>>>
>>>>         if (tp->ecdev) {
>>>>                 ecdev_set_link(tp->ecdev, tp->link_ok(ioaddr) ? 1 : 0);
>>>> +               spin_lock_irqsave(&tp->lock, flags);
>>>> +               rtl_link_chg_patch(tp);
>>>> +               spin_unlock_irqrestore(&tp->lock, flags);
>>>>                 return;
>>>>         }
>>>>
>>>>
>>>>
>>>> On Tue, Dec 3, 2013 at 11:56 AM, Jeroen Van den Keybus <
>>>> jeroen.vandenkeybus at gmail.com> wrote:
>>>>
>>>>> Perhaps try hooking up a normal eth interface to the drive and see
>>>>> what the autoneg comes up with using ethtool. In the past, I have had
>>>>> trouble interfacing an FPGA IP core to a PC Ethernet card when the core was
>>>>> hard wired to 100M FD instead of advertising this using autoneg. The PC
>>>>> card tried to autoneg and then fell back to 100M HD.
>>>>>
>>>>> You could try testing with an EK1100 in between the PC and the drive.
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>> 2013/12/3 Raz <raziebe at gmail.com>
>>>>>
>>>>>> I do not have ethtool over the ethercat device as it is removed. How
>>>>>> can I tell ? eth0 is 100Mbps but it is my public interface. eth1 is my
>>>>>> ethercat interface.
>>>>>>
>>>>>> There is always a link.  the first slave is a drive, not an io device
>>>>>> . This drive is running xilinix with port stack and ip core of beckhof.
>>>>>> I am trying to debug now the realtek driver, let see...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 3, 2013 at 11:36 AM, Jeroen Van den Keybus <
>>>>>> jeroen.vandenkeybus at gmail.com> wrote:
>>>>>>
>>>>>>> It would be very useful to know whether e.g. the interfaces ended up
>>>>>>> in 100M half duplex or so. Is there a link in those cases ? What's the
>>>>>>> first EtherCAT station ? Maybe it doesn't handle autoneg properly during
>>>>>>> its reset phase ?
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2013/12/3 Raz <raziebe at gmail.com>
>>>>>>>
>>>>>>>> hey
>>>>>>>> Problem happens with intel e1000e as well as realtek.  One way to
>>>>>>>> bypass it is to boot the master while the ethernet-ethercat cable is
>>>>>>>> disconnected, and once master claims the interface , connect this cable.
>>>>>>>> This appears to work.
>>>>>>>> So , There some sort of of initialisation error.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 2, 2013 at 11:32 AM, Raz <raziebe at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I still do not have a scenario. it "sometimes" happens. The
>>>>>>>>> -DRTL8169_DEBUG is something i did not know, so i will check and see. thx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Dec 2, 2013 at 11:27 AM, Jeroen Van den Keybus <
>>>>>>>>> jeroen.vandenkeybus at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Is there a difference between cold and warm boot ? Does unloading
>>>>>>>>>> the ec driver, loading/unloading the stock r8169 driver and then reloading
>>>>>>>>>> the ec driver work better ? Same scenario but with Realtek drivers (r8168)
>>>>>>>>>> ? Also perhaps compile with -DRTL8169_DEBUG ?
>>>>>>>>>>
>>>>>>>>>> Just some thoughts.
>>>>>>>>>>
>>>>>>>>>> J.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2013/12/2 Raz <raziebe at gmail.com>
>>>>>>>>>>
>>>>>>>>>>> The timeouts happens after the system boots and not while slaves
>>>>>>>>>>> are in in OP mode. So my transmit is irrelevant here, even though a
>>>>>>>>>>> transmit happens only from a single thread of through an ioctl ( SDO reads
>>>>>>>>>>> and so on..)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 2, 2013 at 11:01 AM, Jeroen Van den Keybus <
>>>>>>>>>>> jeroen.vandenkeybus at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> 1. why do you disable the rtl8169_phy_timer  timer ?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The rtl8169_phy_timer is regularly polled in ec_poll instead.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> 2.  In rtl_hw_start_8168 : why do disable RTL_W16(IntrMask,
>>>>>>>>>>>>> tp->intr_event); ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> The drivers are all non-blocking and interrupt-free. All work
>>>>>>>>>>>> that interrupt handlers normally do is done in ec_poll instead.
>>>>>>>>>>>>
>>>>>>>>>>>> If you cannot send packets anymore, I suspect that you may have
>>>>>>>>>>>> overrun the tx queue, i.e. sent a packet before the previous one has been
>>>>>>>>>>>> completed. You're also not calling the ethercat transmission functions from
>>>>>>>>>>>> different threads, right ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> thank you
>>>>>>>>>>>>> raz
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> https://sites.google.com/site/ironspeedlinux/
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> etherlab-users mailing list
>>>>>>>>>>>>> etherlab-users at etherlab.org
>>>>>>>>>>>>> http://lists.etherlab.org/mailman/listinfo/etherlab-users
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> https://sites.google.com/site/ironspeedlinux/
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> https://sites.google.com/site/ironspeedlinux/
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> https://sites.google.com/site/ironspeedlinux/
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> https://sites.google.com/site/ironspeedlinux/
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> https://sites.google.com/site/ironspeedlinux/
>>>>
>>>
>>>
>>
>>
>> --
>> https://sites.google.com/site/ironspeedlinux/
>>
>
>

-- 
https://sites.google.com/site/ironspeedlinux/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.etherlab.org/pipermail/etherlab-users/attachments/20131203/c746a2d1/attachment-0004.htm>