Ethernet LAN CRC Errors on GigE.

Bookmark this page:  YahooMyWeb Yahoo!  Google Google  Windows Live Favorites Windows Live  del.icio.us del.icio.us  digg digg  Add to Netscape Netscape
Subject Author Date
CRC Errors on GigE. nkarkhan 12-10-06
Posted by nkarkhan on December 10, 2006, 3:28 am
Please log in for more thread options


Hello,
I am hoping somebody with more GigE signalling knowledge can help me
out here.

I have a hard real time system that is build up of a distributed
network of processors. I chose ethernet to be my backbone connected via
a ethernet switch. I cannot afford any packet drops (no time for
retransmits or anything).
Now, I know that layer 3 protocols are designed to withstand packet
drops and have retx mechanisms for a reason. but in practise, with good
ethernet switches, network cards and cables.. systems go on forever
without dropping packets.
In a custom system, with < 50 foot cables, good connectors, good
shielding ..it should be possible to expect long periods of time (24
hrs?) without any drops.

Well, on my production systems, I see a drop now and then. Since the
architecture worked fine for a year in the lab. My first idea was that
we had bad cables, connectors , switch or nic card. Replacing the
switch seemed to make a difference. (It was a Dell, i have since
recommended cisco). No errors were seen when the Dell switch was
replaced with a cisco.
Hooking up the Dell (same ports) and testing with a spirent didnt show
any errors. same for the cables and the nic card.
so it must be a combination of the Dell with the cabling and the nic
card.
what kindof equipment do i need to test the signalling quality of the
Dell, cable and the Nic card?
I need to install some tests that could be done in manufacturing to
avoid such problems.

Looking forward to you ideas,
Nitin.


Posted by stephen on December 10, 2006, 6:27 am
Please log in for more thread options


> Hello,
> I am hoping somebody with more GigE signalling knowledge can help me
> out here.
>
> I have a hard real time system that is build up of a distributed
> network of processors. I chose ethernet to be my backbone connected via
> a ethernet switch. I cannot afford any packet drops (no time for
> retransmits or anything).
> Now, I know that layer 3 protocols are designed to withstand packet
> drops and have retx mechanisms for a reason. but in practise, with good
> ethernet switches, network cards and cables.. systems go on forever
> without dropping packets.
> In a custom system, with < 50 foot cables, good connectors, good
> shielding ..it should be possible to expect long periods of time (24
> hrs?) without any drops.

if you want more reliability (and less constraints on cable length) - try
looking for fibre based NICs.

however - fibre based NICs and switches are going to be more expensive.

as an aside - i would still want a "hard real time system" to do lots of
error checks in case something does break even if it doesnt know how to
recover.....

>
> Well, on my production systems, I see a drop now and then. Since the
> architecture worked fine for a year in the lab. My first idea was that
> we had bad cables, connectors , switch or nic card. Replacing the
> switch seemed to make a difference. (It was a Dell, i have since
> recommended cisco). No errors were seen when the Dell switch was
> replaced with a cisco.

most switches are contended in some way - so you want to look for a switch
that doesnt have internal bottlenecks given the attached devices, port
speeds and quantities.

> Hooking up the Dell (same ports) and testing with a spirent didnt show
> any errors. same for the cables and the nic card.
> so it must be a combination of the Dell with the cabling and the nic
> card.

No. because you are assuming that if individual links are error free, then
there will not be any drops.

the missing issue here is that the traffic patterns may produce contention -
there are 2 constraints.

if 10 of your devices send packets at 1 other, then as long as the average
rate on the destination port is below 1 Gbps - everything is fine - on
average.

however if the traffic is bursty - then the switch may still need to buffer
some packets. if it runs out of packet buffers during a burst, then packets
will get dropped.

finally - you mentioned the app cannot handle delayed traffic and resends.

you need to think about buffering - since that means delaying packets. there
is a balance here between potential burst size, load and potential delay
thru the switch. A good switch for your requirement would let you tune those
tradeoffs if needed (Cat 650x with Sup 720-3B would be my starting point if
you need lots of ports).

the only practical way around this is to have lots of "headroom" - ie if you
have 1 Gbps ports, then the app is probably OK if it isnt going to generate
more than 200 - 300 Mbps PEAK traffic........


> what kindof equipment do i need to test the signalling quality of the
> Dell, cable and the Nic card?

in the lab at work we use an Agilent N2X? - but your Spirent should be able
to do it (depending on which unit you have).

The big issue with this type of tester in that designing tests and using the
gear, and then making sure the results mean what you think is still a bit of
a black art - assume you are going to need a week or 2 to get the testing
procedures tied down

none of this stuff is cheap though - the flip side is that if you can afford
the tester, then paying for a good switch isnt going to be problem.

> I need to install some tests that could be done in manufacturing to
> avoid such problems.

i suspect that only testing a couple of interfaces is the issue, since just
about any switch should be able to run a couple of ports at wire speed - try
scaling up the tests to more ports.
>
> Looking forward to you ideas,
> Nitin.
>
--
Regards

stephen_hope@xyzworld.com - replace xyz with ntl



Posted by nkarkhan on December 10, 2006, 1:23 pm
Please log in for more thread options


Thank you Stephen,
should have mentioned a few more things.
We use gigE because the time on the wire for a 500 byte packet is low
(4 usecs?) v/s 100 mb (40 usec).
Fibre is going to be a tough sell, we have 10s of thousands invested in
custom cables that have cat 6 bundled in them.
If we do have contention and the switch drops packets, i would except
it to show up as some counter on the Dell. The Dell statistics are
pristine. The drops are corelated to CRC errors seen. The crc errors
are seen on the NIC side.
We only send a packet every millisecond in a SW 1/2 duplex fashion from
3 nics (which will be scaled up), so i think we have more than enough
bandwidth.
If the switch gets contention at any time, it will only have worst case
2 packets going to 1 output port to deal with. any switch worth its
salt should have support a queue size of 2.
(My switch requirements are very simple. 8 ports (but we use 24), 2
vlans, gige capability, very low latency (i am assuming all switches
these days are cut-thru and not store and forward) and most of all no
errors)

hence my earlier conclusion that the errors are caused by a combination
of the dell, cables, interconnects and the nic. replacing the switch
with the cisco makes the problem go away. I didnt see any spirent tests
to check signal quality..or maybe there is?

Nitin.


Posted by stephen on December 10, 2006, 3:01 pm
Please log in for more thread options


> Thank you Stephen,
> should have mentioned a few more things.
> We use gigE because the time on the wire for a 500 byte packet is low
> (4 usecs?) v/s 100 mb (40 usec).

Agreed it is faster - but does it matter for what you are doing with the
packet timing?

FWIW 100 Mbps would be more tolerant of cabling issues....

> Fibre is going to be a tough sell, we have 10s of thousands invested in
> custom cables that have cat 6 bundled in them.

OK - however next time it would be a good idea to go for flexible cabling :)

> If we do have contention and the switch drops packets, i would except
> it to show up as some counter on the Dell. The Dell statistics are
> pristine. The drops are corelated to CRC errors seen. The crc errors
> are seen on the NIC side.

nope - that isnt a drop in the switch. It sounds more like the switch
forwards the packet, but it then arrives at the NIC with an error - this
could be noise, or a switch problem of some sort.

> We only send a packet every millisecond in a SW 1/2 duplex fashion from
> 3 nics (which will be scaled up), so i think we have more than enough
> bandwidth.
> If the switch gets contention at any time, it will only have worst case
> 2 packets going to 1 output port to deal with. any switch worth its
> salt should have support a queue size of 2.
> (My switch requirements are very simple. 8 ports (but we use 24), 2
> vlans, gige capability, very low latency (i am assuming all switches
> these days are cut-thru and not store and forward) and most of all no
> errors)

yes - any switch that can deal with different speed ports more or less has
to be store and forward.
>
> hence my earlier conclusion that the errors are caused by a combination
> of the dell, cables, interconnects and the nic. replacing the switch
> with the cisco makes the problem go away. I didnt see any spirent tests
> to check signal quality..or maybe there is?

AFAIR - dont know :) -

with all of these analysers you do get inbound and outbound packet counts
(otherwise they couldnt do the basic job of checking drop rates for a test
device) - so those give you an idea of how many packets dont arrive - which
is really all that matters.

however - you should get error rates for the ports on the analyser, although
you probably have to tweak the test config to see them.
Try running a long term soak test - continuous overnight or over a weekend
and see what you get?

i would be very suspicious of a cabling error rate that goes away when you
swap out the switch - but the key fact for you is probably that you dont get
them with the cisco rather than why they happen with Dell.....
>
> Nitin.
>
--
Regards

stephen_hope@xyzworld.com - replace xyz with ntl



Posted by nkarkhan on December 11, 2006, 4:18 pm
Please log in for more thread options



stephen wrote:
> > Thank you Stephen,
> > should have mentioned a few more things.
> > We use gigE because the time on the wire for a 500 byte packet is low
> > (4 usecs?) v/s 100 mb (40 usec).
>
> Agreed it is faster - but does it matter for what you are doing with the
> packet timing?

Unfortunately it does.
Stuff needs to execute in a few hundred micro-seconds. We have been
spending weeks trying to squeeze micro-seconds from our algorithms.
Giving up 40 usecs is a lot.


Similar ThreadsPosted
CRC Errors on GigE. December 10, 2006, 3:28 am
What happened to cause these network errors? September 29, 2004, 3:45 pm
"TCP segment of a reassembled PDU" Errors August 10, 2007, 4:38 pm
Question about link-flap errors November 29, 2004, 5:37 pm
Getting Checksum Errors on Gigabit Ethernet Cards October 28, 2004, 6:23 am
Official - Collisions are errors - Nokia, Cisco November 1, 2005, 4:48 am
Problems with synchronization and clock recovery cause errors like... January 8, 2006, 5:49 pm
Tool To Diagnose Network Configuration Errors November 19, 2006, 12:22 am
Anyone used Iperf or Netperf w/GigE? October 13, 2004, 8:51 pm
Low Price Guarantee DS3, OC3, GigE March 3, 2007, 7:55 pm
PCMCIA GigE adapters with jumbo frames February 14, 2006, 2:46 pm
Looking for Cheap GigE 850nm to 1310nm Conversion July 12, 2007, 5:17 pm
Chassis With Plug In Four Port GigE Switches? July 21, 2007, 12:53 am
Jumbo Frames - over cross-over cable, and over GigE switch October 23, 2004, 4:23 pm
Small GigE Switch That Fits Two Across 19" Rack Shelf? September 16, 2006, 2:49 pm