CRC Errors on GigE.

Hello, I am hoping somebody with more GigE signalling knowledge can help me out here.

I have a hard real time system that is build up of a distributed network of processors. I chose ethernet to be my backbone connected via a ethernet switch. I cannot afford any packet drops (no time for retransmits or anything). Now, I know that layer 3 protocols are designed to withstand packet drops and have retx mechanisms for a reason. but in practise, with good ethernet switches, network cards and cables.. systems go on forever without dropping packets. In a custom system, with < 50 foot cables, good connectors, good shielding ..it should be possible to expect long periods of time (24 hrs?) without any drops.

Well, on my production systems, I see a drop now and then. Since the architecture worked fine for a year in the lab. My first idea was that we had bad cables, connectors , switch or nic card. Replacing the switch seemed to make a difference. (It was a Dell, i have since recommended cisco). No errors were seen when the Dell switch was replaced with a cisco. Hooking up the Dell (same ports) and testing with a spirent didnt show any errors. same for the cables and the nic card. so it must be a combination of the Dell with the cabling and the nic card. what kindof equipment do i need to test the signalling quality of the Dell, cable and the Nic card? I need to install some tests that could be done in manufacturing to avoid such problems.

Looking forward to you ideas, Nitin.

Reply to
nkarkhan
Loading thread data ...

if you want more reliability (and less constraints on cable length) - try looking for fibre based NICs.

however - fibre based NICs and switches are going to be more expensive.

as an aside - i would still want a "hard real time system" to do lots of error checks in case something does break even if it doesnt know how to recover.....

most switches are contended in some way - so you want to look for a switch that doesnt have internal bottlenecks given the attached devices, port speeds and quantities.

No. because you are assuming that if individual links are error free, then there will not be any drops.

the missing issue here is that the traffic patterns may produce contention - there are 2 constraints.

if 10 of your devices send packets at 1 other, then as long as the average rate on the destination port is below 1 Gbps - everything is fine - on average.

however if the traffic is bursty - then the switch may still need to buffer some packets. if it runs out of packet buffers during a burst, then packets will get dropped.

finally - you mentioned the app cannot handle delayed traffic and resends.

you need to think about buffering - since that means delaying packets. there is a balance here between potential burst size, load and potential delay thru the switch. A good switch for your requirement would let you tune those tradeoffs if needed (Cat 650x with Sup 720-3B would be my starting point if you need lots of ports).

the only practical way around this is to have lots of "headroom" - ie if you have 1 Gbps ports, then the app is probably OK if it isnt going to generate more than 200 - 300 Mbps PEAK traffic........

in the lab at work we use an Agilent N2X? - but your Spirent should be able to do it (depending on which unit you have).

The big issue with this type of tester in that designing tests and using the gear, and then making sure the results mean what you think is still a bit of a black art - assume you are going to need a week or 2 to get the testing procedures tied down

none of this stuff is cheap though - the flip side is that if you can afford the tester, then paying for a good switch isnt going to be problem.

i suspect that only testing a couple of interfaces is the issue, since just about any switch should be able to run a couple of ports at wire speed - try scaling up the tests to more ports.

Reply to
stephen

Thank you Stephen, should have mentioned a few more things. We use gigE because the time on the wire for a 500 byte packet is low (4 usecs?) v/s 100 mb (40 usec). Fibre is going to be a tough sell, we have 10s of thousands invested in custom cables that have cat 6 bundled in them. If we do have contention and the switch drops packets, i would except it to show up as some counter on the Dell. The Dell statistics are pristine. The drops are corelated to CRC errors seen. The crc errors are seen on the NIC side. We only send a packet every millisecond in a SW 1/2 duplex fashion from

3 nics (which will be scaled up), so i think we have more than enough bandwidth. If the switch gets contention at any time, it will only have worst case 2 packets going to 1 output port to deal with. any switch worth its salt should have support a queue size of 2. (My switch requirements are very simple. 8 ports (but we use 24), 2 vlans, gige capability, very low latency (i am assuming all switches these days are cut-thru and not store and forward) and most of all no errors)

hence my earlier conclusion that the errors are caused by a combination of the dell, cables, interconnects and the nic. replacing the switch with the cisco makes the problem go away. I didnt see any spirent tests to check signal quality..or maybe there is?

Nitin.

Reply to
nkarkhan

Agreed it is faster - but does it matter for what you are doing with the packet timing?

FWIW 100 Mbps would be more tolerant of cabling issues....

OK - however next time it would be a good idea to go for flexible cabling :)

nope - that isnt a drop in the switch. It sounds more like the switch forwards the packet, but it then arrives at the NIC with an error - this could be noise, or a switch problem of some sort.

yes - any switch that can deal with different speed ports more or less has to be store and forward.

AFAIR - dont know :) -

with all of these analysers you do get inbound and outbound packet counts (otherwise they couldnt do the basic job of checking drop rates for a test device) - so those give you an idea of how many packets dont arrive - which is really all that matters.

however - you should get error rates for the ports on the analyser, although you probably have to tweak the test config to see them. Try running a long term soak test - continuous overnight or over a weekend and see what you get?

i would be very suspicious of a cabling error rate that goes away when you swap out the switch - but the key fact for you is probably that you dont get them with the cisco rather than why they happen with Dell.....

Reply to
stephen

nkarkhan wrote in part:

Then you should also have some money invested in proper Cat5e/6 certifications of those cables!

Then use the Crisco! Switches vary in design. All have limitations. I would expect that a higher-end device would have more crossbar busses to permit more simultaneous transfers and avoid queuing. Mangled queuing might show as CRC errors.

A simple test is to load two stations with `ttcp` in full duplex, then two more, then two more, etc and see what happens to throughput.

Reply to
Robert Redelmeier

Nit picking time :) Layer 3 protocols would be "Network" layer protocols such as IPv6 or IPv6. They do _not_ have retransmission mechanisms. There may be some Layer3 protocols with retransmission, but generally it is the Transport layer or Layer 4 which has retransmission - and even that isn't absolute - TCP retransmits, UDP does not.

Doesn't "Ethernet" have a Bit Error Rate specification? Does that spec for BER permit 24 hours without any errors at your bitrate? I wouldn't be at all surprised if the BERs are "conservative" (ie higher than one may often see) but if you have such hard requirements...

rick jones

Reply to
Rick Jones

Although if the OP's actual traffic is more request/response using a request/response benchmark might be better.

rick jones

Reply to
Rick Jones

Agreed. That's what `ping -f` is for :)

`ttcp` is good at loading up channels so you can see what request/response becomes under heavy load. But still not as good as a dozen or two fast-running req/resp.

-- Robert

Reply to
Robert Redelmeier

Are you sure the NIC is operating at gigabit speed? If the NIC is configured for 100Meg full-duplex and the switch for 100Meg half-duplex you will get late collisions which will sure up as CRC errors on the NIC side. May be the NIC is fixed at 100Meg full-duplex and not auto-negotiating. The Cisco switch might be able to work this out and configure itself to match the NIC.

Reply to
Marris

Sure does ;-)

I found an interesting paper, "Is there such a thing as zero bit error rate?"

formatting link
I also found an article,
formatting link
indicates a product with 10 Gbit/s throughput over Cat 5e, with a theoretical BER of 10**(-12) rather than the standard 10**(-10).

10 Gbit/s is 10**10 bit/s, so at 10**(-12) BER would expect a bit error every 100 seconds on average. There are 864 intervals of 100 seconds per day, so to go 24 hours without one error would be 864 times the mean. If we drop back to 1 Gbit/s but still assume that our "custom cables" with good connectors etc., are able to achieve 10**(-12), then that'd still be 86.4 times the mean. We could work out the probability of that if we knew the distribution of errors, but we can more simply just say 86400 seconds per day times 10**9 bits/s, then invert the ratio, to get a BER of about 10**(-15) at gigE, 10**(-16) at 10 Gig. That's three or four orders of magnitude better than is achieved with advanced devices with tight specs, which is sufficient to strain credibility.

The first paper I referenced above showed that Cat 5e is sufficient to achieve 10**(-10) (the nominal standard) at GigE if one assumes worse case spec-conforming parts, but if one extrapolates one more step one can see it would not be sufficient for 10 Gig. The second paper indicates 10 Gig uses PAM-10; someone who works more closely in the field could substitute that into the equation below Figure 1 in the first paper in order to determine the SNR dB.

Reply to
Walter Roberson

Unfortunately it does. Stuff needs to execute in a few hundred micro-seconds. We have been spending weeks trying to squeeze micro-seconds from our algorithms. Giving up 40 usecs is a lot.

Reply to
nkarkhan

The NIC is configured for 1000Meg and full duplex. The Switch status shows that the port is the same. Also because of the way the SW works, we cannot have collisions.

One end station sends a packet, the other side responds back in under

800 usecs. if any packet is lost, the machine keels over.
Reply to
nkarkhan

:) You are right in your ip-centric view :). LLC-2 is a layer 2 protocol isnt it? and its Layer 2. I should have said, higher-level protocols instead of Layer-3.

This is another problem. Bit error rates are specified for Cat5, cat5e and cat6 cables. Going by those we should be seeing a whole lot more errors. But in practise they are seldom seen in networks where you have good cabling, connectors etc. Not being a EE and not knowing anything about signalling, I have no idea how the bit error rates are determined.

Reply to
nkarkhan

Walter, I did go thru a lot of such data before designing the system. Cat 5e BER was what 10 ** -10 ? cat 6 was 10 ** -12 i believe. in our case since we send 500 bytes of data in each direction every msec. it translates to

1000 * 10 (make maths easier with 10 bits in a byte) * 1000 bits every second 10 ** 7 bits a second. So we should see an error every 10 ** 10/ 10 ** 7 seconds. We do not see an error every 30 mins. Since, my test code has been running for over a year.. i should have seen something. My desktop which hasnt been booted in over a month, doesnt show any nic errors. Calls to friends at Cisco Cat group and Extreme networks, asking about their experience with crc errors over time doesnt reveal any ancedotal data where crc erros were seen. So, yes it seems like some EE can prove the bit error rate.. but I dont see it in practise.

I am very happy with all the questions raised and am hoping some will eventually to the problem.. I am back to my question. My cables are good. (cat 6 rated), my patch cords are good. Replacing the dell makes the problem go away. I cannot replace the nic. What equipment do i need and what testing do i need to do, to ensure that my signal quality is pristine enough for the nic not to give me errors. Should i look at S/N ratio in my setup? And if the S/N ratio is a lot better than whats specified at the end of the link, i could calculate the probablity and see if I can live with it? (Seems like a resonable idea, let me run it by the EEs we have here)

My current cisco is good enough, but how do i know that my next batch of cisco switches i buy wont start giving me the same error? The earlier Dells were pretty good too.

Walter Robers> > >nkarkhan wrote:

Reply to
nkarkhan

That the machine would keel-over (which I take to mean "dies") if any packet is lost seems rather brittle... It is one thing to have "the machine" have to go through some sort of involved recovery procedure (short of reinitialization), but to have it keel-over...

rick jones

Reply to
Rick Jones

"Keel over" doesnt mean die, but it means that customer is pissed off (pretty close to "die" i guess). The 1 packet missed and its a disaster design comes from limiting the machines top speed, if i reduce the machines top speed, i can tolerate

2 -3 -4 packets being dropped. But I do want to go as fast as safely possible.

Plus, I was expecting bit blasting noise to come from sources like Motors starting, electric spikes, sunspots etc. All of which last for

10+ milliseconds. So the idea was that if I miss one packet, I would most probably lose the next 10-20 msecs of packets too. I cannot reduce the speed to allow for 10 missed packets anyways, so why not just design for no losses and add sheilding to keep the noise out. b.t.w none of these noise sources are present in the current crc error scenario.
Reply to
nkarkhan

i think there is a fundamental issue here - in that "real networks" are not prefect - but you are assuming yours will be (for some approximation of perfect).

more importantly, its a bad idea to push any such system to the edge (which is what it sounds like).

once your pristine new system is installed, it will be handed over to the tender mercies of your customer, the users / operators and their maintenance techs.

They will change bits of the system, plug them into a backbone, distribute the components across a WAN, break the limits on cable lengths, nail thru the cables, tie wrap your wonderful leads to a power cable, try to use SNMP to collect stats, plug a phone in instead of the Ethernet.....

So a spanning tree event that stops traffic forwarding for 10s of seconds is probably not going to be welcome then?

when i have seen networks used in control systems before, critical "bits" used 2 different networks in parallel, so that a 1 off failure may only affect one (although big interference spikes, power effects etc may well be common).

Reply to
stephen

I think that most of us are trying to say that despite your good fortune thusfar, there is no "Ethernet" equipment out there you can safely ass-u-me will not give you errors at all. Particularly if this stuff is to go into what may not be all that pleasant an environment.

The EE's signed-off on this? Almost sounds like something where you could tell us what the kit will "really" be doing but then would have to shoot us.

You can't. You can only assume that they will be no worse than the rated BER for Ethernet, and that if they _are_ they are defective. Ethernet equiepment was neither spec'ed nor I suspect implemented for what appear to be the rather tight tolerances your application appears to have.

Ethernet is "best effort" it makes few if any guarantees and it looks you need/want gurantees "Ethernet" specifications cannot give you.

If I were looking to maximize the reliability of my Ethernet network, I'd probably want switches with ECC memory in them lest a single bit error while frames were being forwarded either trigger a parity event causing the switch to drop the frame, or something detected by the final destination. I'd probably want the same thing in all the "data paths" through the switch (busses etc). I'd probably want similar things in my NICs if that were possible. Of course, being "best effort" means that implementors of Ethernet kit can simply say "well, all we need is parity to just detect it and drop the frame" or "we can rely on the CRC to protect data integrity and have the frame dropped. Again, because "Ethernet" is not specced for what you seem to want to do.

If I could, I probably wouldn't _really_ run Ethernet at all but perhaps something that at the data-link level looked remarkably _like_ Ethernet - with its header and all, but at the physical layer had some, perhaps gastly, quantity of FEC - Forward Error Correction - such that even if there were a single-bit error in a frame on the wire or fibre the FEC could correct it. Something that went well above and beyond anything in the current physical encodings. The idea is to take a physical BER and have a much, Much, MUCH better "effective" BER at the frame level. I would guess that the basic principles wouldn't be all _that_ far off from what folks do when talking to interplanetary space probes...

Barring the existence of that sort of thing, I might be convinced to try to simulate that by telling my NIC to give me everything it sees, regardless of the Ethernet CRC and us my own FEC in the frame. I would still be at the mercy of a switch possibly detecting a parity or ECC issue in its own memory or data paths and toasting some of my frames though...

rick jones

Reply to
Rick Jones

thank you for your assistance,

I think your comments below has a finger slip or two..

- you say that ethernet switches performing below a certain BER have failed and are defective. My only point of this conversation was to determine the equipment that i need to figure out the BER :)

- crc is part of the ethernet type 2 frame. and every other commonly used ethernet type that i know of (802.2, snap,novell). You kind of have to make sure that the DA/SA/ FCS thing are constant across all ethernet frames if they have to live on the same segment and have to be processed by the same nics/switches.

- the "best effort" probably refers to the possibility of packets being dropped because of excessive collsions not the implementer using a parity bit instead of a crc.

Let me take this question over the ethernet switch forum, perhaps they have different ideas. thank you again for your time.

Rick J> > What equipment do i need and what testing do i need to do, to ensure

Reply to
nkarkhan

It is meaningless to specify or measure BER without also specifying or injecting the type and amount of noise that is creating those errors. For example, the original Ethernet specification provides for a BER not-to-exceed 10^-9, which the channel provides by assuring a 5:1 (14 db) S/N ratio in the presence of a 2 V/M plane wave fieldstrength from 10 KHz-30 MHz, and 5 V/M from 30 MHz-1 GHz. That is, it is *very specific*; if you embed the system in a different noise environment, then the S/N ratio may be different, and the resulting BER may exceed the rated 10^-9.

Thus, the fact that all of your equipment conforms to IEEE 802.3 is NOT an assurance that you will see a BER of 10^-9 (or -10, -12, or any other specified value for the particular physical channel you are using). Particularly with the more recent twisted-pair media, Ethernet is designed for operation in benign office-automation environments; most commercial equipment is not targeted for harsh industrial use. Such use will likely result in more errors than in the benign environment, and possibly more than specified in the IEEE standard.

Furthermore, while BER is *historically* the most common means of specifying error rates in communications channels, it is sometimes an inappropriate metric in Ethernet (or any packetized communications system). In Ethernet, the only means of error detection is the frame check sequence (FCS, implemented as a 32-bit CRC). A single bit error in the frame will result in a CRC error, and the frame will be discarded. However, multiple-bit errors in the same frame will have the same result. Thus, there is often little difference to the higher-layer protocols and applications between a low BER and a high BER channel, if the errors are clustered in time such that they corrupt only a single frame. Indeed, it is this bursty nature of noise on many channels that prompts the use of packetized systems and CRC error-detection schemes.

What matters in Ethernet is the *frame loss rate* (FLR), not the bit error rate; i.e., the probability that a given frame will be discarded due to one or more errors within the frame. FLR is what is actually measured by most Ethernet test systems; they send frames through the channel and count CRC errors. For a low, Gaussian-distributed BER, the FLR is the same as the BER, since the probability of multiple errors within a single frame is vanishingly small. With higher errors rates, and with bursty noise sources, this assumption is no longer valid; the BER may be high, but the FLR may be much lower. Again, this points out the need to precisely specify the noise environment for which the BER/FLR is being specified or measured.

You should expect occasional errors from noise in the Ethernet physical channel. If your system cannot tolerate such errors, then it is either improperly designed, or it cannot use Ethernet as its communications mechanism. If you want to know how often you will encounter such errors in your particular noise environment, you should run data through the system *in your particular environment* for a statistically-significant period of time (i.e., long enough to see many errors, so that the error probability and distribution can be predicted to a high degree of certainty).

No. The "best effort" nature of Ethernet applies to everything: physical channel errors, frame loss due to excessive collisions, unavailability of memory buffers in end stations and switches, inadequate CPU performance for sustained heavy loads in end stations and switches, memory errors, etc.--i.e., *everything* is treated as "best effort." It makes life much easier for equipment and system designers, as long as "best effort" is good enough.

Now, in the vast majority of application environments, "best effort" is good enough; Ethernet would never have achieved such widespread use if the system didn't work well enough to support the bulk of applications. "Best effort" may not be adequate for *your particular* application, however; hence my earlier statement that your application is either poorly designed, or cannot use Ethernet as its communication mechanism.

This seems to be a statement of the form, "Mommy said 'No,' so I'll ask Daddy."

-- Rich Seifert Networks and Communications Consulting 21885 Bear Creek Way (408) 395-5700 Los Gatos, CA 95033 (408) 228-0803 FAX

Send replies to: usenet at richseifert dot com

Reply to
Rich Seifert

Cabling-Design.com Forums website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.