Network Failure - No Idea How to Troubleshoot

- B
- BC
  
  Contact options for registered users
posted
18 years ago

Thu, Nov 10, 2005 10:00 AM

Hi there

Hope some kind soul out there can help or point me in the right direction.

For the last month or so our users have experienced a network failure about once per week. Rebooting the main 48 port unmanaged switch (netgear) resolves the problem.

I would like to inspect what's going on at the switch to try and get to the bottom of the failure. I've downloaded and installed ethereal, however, I have no idea what I should be looking for in the log files. Can anyone help?

At present we have our W2K servers running on a copper gigabit switch, this switch is then connected to the 48 port switch. Various other switches are downstream of this switch.

I plan on hanging a hub between a downstream switch and its uplink to the 48 port switch and then using ethereal to analyse what's happening.

However, I've no idea what I should be looking for! Can anyone help please??

Thanks

BC

- B
- BC
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Nov 10, 2005 11:12 AM

Thanks for the response guys. To the best of my knowledge nothing has changed on the network. No new hubs / switches, no new cabling, not even new workstations. The hangs seem to occur about once per week, during working hours, but not at periods of peak activity. Reboot of the 48 port switch temporarily solves the problem. I'm going to try swapping out the switch this evening and see how things go....

Thanks again.

BC

- R
- Robert Redelmeier
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Nov 10, 2005 6:34 PM

This is a very tough problem to troubleshoot. First, what has changed? Any new hardware, software or usage pattern?

There are some simple hardware things to check: have you tried swapping out the switch? Is it's powerfeed [UPS] good? Are any of the ports dead? Or likely to die, 'cuz they're on long/outdoor runs?

Unfortunately, this will only sniff traffic on that branch of the network. And may not catch malformed packets. This is why people buy managed switches.

When do the hangs occur? During heavy usage, idle times?

-- Robert

- A
- Alexey G. Khramkov
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Nov 10, 2005 11:02 PM

First of all, recall all new h/w, especially different (non netgear) verdors. Sometime new NICs can kill the switch.

Any nonstandard stuff. Our programmers added some IP and TCP options which killed the switch. Usually that options are not documented by firmware vendor.

Check the MAC capacity. If it overburdened the switch starts to work like the hub. Thus collisions can kill all domain of collisions.

Bad Thing (TM). I have old 10Mb hub only. Thus it has bottleneck without doubts. Are you lucky and have 100Mb hub at least?

HTH, agkhram

- J
- James Knott
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Nov 11, 2005 1:37 AM

I assume you mean no traffic is getting through? With ethereal, you can monitor on one computer, while pinging it from others. Does it get through? When you ping, can you see lights flashing on the NICs and switch? If nothing's getting through the switch and only rebooting clears the problem, I'd say the switch is NFG.

As always with troubleshooting, take things one step at a time and verify the simple stuff first.

- B
- bkbigpond
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Nov 15, 2005 3:35 PM

Can you have the switch send syslog messages? Is there any spanning-tree issues .. loops forming etc. I'm currently troubleshooting a Cisco environment (6509, 6513 core switches). For no apparent reason (no changes being made) the network has ground to a halt twice. Very hard to determine what's going on if there's no logging available. External logging is ok but can only happen while the network/switch is operational. Local debugging / logging is better but if you have to shutdown a switch in a hurry (to restore network services) you lose what logging was there.

BernieM

- B
- BC
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Dec 6, 2005 8:48 AM

Hmmm.....replaced the suspect 48 port switch with a spare 24 port switch (turns out only 23 ports were in use on the 48 port). Everything fine for 25 days, then bang, network failure again yesterday. Reset of the switch restores connectivity. The only thing I haven't change on this level is the server gigaswitch to which the

48porter uplinks. I intend to test this out with a spare at the weekend. If this fails I'm at a total lost. I'm pretty sure there are

no physical loops in the network. The fact that the switches are unmanaged makes it difficult to troubleshoot. Any further troubleshooting advice would be appreciated!

- B
- BC
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Dec 6, 2005 1:06 PM

Fans seem to be spinning round nicely / clear of crap. Environmental conditions are static. Thanks for the response tho!

- W
- William P.N. Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Dec 6, 2005 3:25 PM

Well, since I'm a hardware guy, is there any chance that the cooling fan(s) are {clogged, stopped, blocked} or that the closet the switch is in is warmer than before? Just a random thought...

- R
- Robert Redelmeier
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Dec 6, 2005 9:29 PM

As you say, difficult. One reset in 25 days isn't that horrible, but may not be acceptable in a commercial environment.

There are two general causes of switches needing resets: hardware and software. The hardware side would be things like static electricity, lightening, poor grounding/interbuilding These can also be permanent failures.

The software side of things is more likely to be cause by unpredicted behaviour from high loads, buffer overflows, evil packets. Make sure jumbo packets are turned off.

-- Robert

- B
- BC
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Dec 7, 2005 7:17 AM

Thanks for the reply Robert.

Switches are unmanaged therefore no way of connecting to the switch in or out of band to view / manage, can't see if ARP tables are overflowed etc. I'm suspecting either screwy internal ARP / MAC tables or broadcast storm. (or DoS attack?!)

I'm currently hanging ethereal off of a hub connected to the switch to see if there's an issue with broadcast traffic.

As the problem is relatively infrequent I think a weekly switch reboot procedure is probably the most pragmatic thing I can do. I'm hoping to

replace the backbone with a L2 managed switch early next year, which might help me gain further insight into the problem.

- B
- BC
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Dec 7, 2005 11:45 AM

Thanks for the response Paul. Interesting idea about data "spikes". I'm not aware of any high network utlising apps that are coinciding with the network outages, however, it is a distinct possibility. We have a document management app that IS used infrequently, and if used incorrectly, can fire enormous amounts of data across the wire. I think this is worth investigating.

I don't think theres a problem with the computer closet electrics. I've recently had these tested in lieu of a new backup generator and nothing has shown up. The actual hardware is protected by UPS with surge protectors. Although again power spikes could be an issue. I'll ask our sparkies to monitor over a longer period.

Thanks for your valid input!

- B
- BC
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Dec 7, 2005 11:45 AM

Thanks for the response Paul. Interesting idea about data "spikes". I'm not aware of any high network utlising apps that are coinciding with the network outages, however, it is a distinct possibility. We have a document management app that IS used infrequently, and if used incorrectly, can fire enormous amounts of data across the wire. I think this is worth investigating.

I don't think theres a problem with the computer closet electrics. I've recently had these tested in lieu of a new backup generator and nothing has shown up. The actual hardware is protected by UPS with surge protectors. Although again power spikes could be an issue. I'll ask our sparkies to monitor over a longer period.

Thanks for your valued input!

- P
- Paul Vacquier
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Dec 7, 2005 5:38 PM

There's not a chance that they are running a job locally that could introduce a 'spike' is there? Thinking of, say, payroll running up a 'burster' or the welding shop firing up that old MIG welder that only gets fired up once a month or so for the 'special jobs'? Could you run a mains supply tester and look for brown? outs - or spikes that could be picked up by the equipment.

May be that they are running a job that's chucking a massive load of data to the network (just a straw - really wouldn't expect that)

- R
- Robert Redelmeier
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Dec 7, 2005 8:13 PM

I doubt something as common as ARP table overflow would cause this. However, you could easly get a broadcast storm if older computers running NetBEUI protocols get connected and start trying to access network resources.

Broadcast traffic should be received on all ports.

It is pragmatic, but I'm not sure it will help much. Something happens on your network to cause the hang. Up until it happens, everything is likely fine and the reboots do nothing. This isan't a case of slow deterioration.

That will help. Please remember that cheap unmanaged switches are just that. They're not meant for much cascading. I would also try to stick to the same manufacturer.

-- Robert

- W
- w_tom
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Dec 7, 2005 10:36 PM

You are currently trying to fix a problem by first solving it. Instead, first learn facts. Solutions come later. Currently you don't even know if this is a data problem or a hardware problem. Until you know that, then you don't even know where to begin with a solution. Those adjacent UPSes or power strip protectors do nothing useful and may even may contribute to the problem. Don't - not for one minute - assume surge protector is same as surge protection. They are two different components of a protection 'system'.

First necessary are facts. For example, exactly when the problem occurs, what is happening simultaneously. That means you need a tester that will see the problem and record when the failure happens.

A simplest diagnostic tool is ping that comes with every OS. Ping can be setup to ping repeatedly. Then one can observe when problems happen. Some programs can do repeat pings and record failure with a time code.

Another test involves stressing the system. All (responsible) ethernet manufacturers provide comprehensive diagnostics. Setup two (or more) NICs with diagnostic from same manufacturer. One will output continuous and worst case data patterns that other NIC(s) will echo back. Does the network stay stable with this worst case testing ongoing?

You currently have provided only one useful fact. The hub appears to be locking - reset by power cycling. Apparently a different hub suffers the same failure. OK. So either the problem is incoming on network wires or is an AC power problem. Numerous types of power problems exist. UPS would only address two - brownouts and blackouts. UPS does not address noise, surges, or harmonics.

This problem need not be created on AC power wires either. Problem could be in safety ground wire. But again, don't even try to fix anything. First what is also on that circuit?. Using a multimeter, what are voltages between every one of three AC prongs on that wall receptacle? Consider later an expensive series mode filter as a temporary solution - a test

- to determine if AC power is even related.

When failure happens, what are all indicators on the hub front panel? What do the indicator lights on each computer's ethernet NIC report? How do these lights change as each computer is disconnected and reconnected to the network - while problem is ongoing? Again, solve things both faster and the first time by recording all such details. Then make only one minimal change to see how each change affects the problem. Solutions come later. Don't fall for those mythical UPS and surge protector solutions. Collect facts so that problem (and not its symptoms) is clearly identified. Solutions come later.

First th> Thanks for the response Paul. Interesting idea about data "spikes".

- A
- Al Dykes
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Dec 8, 2005 2:15 AM

Check the manufacturer's web site for tech notes?

- B
- BC
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Dec 8, 2005 8:23 AM

Thanks for the advice guys. It really is appreciated. The problem that I have is that when the failure occurs it is expected of me to restore network connectivity immediately. Therefore, difficult to troubleshoot what is actually happening at the time of failure. I've taken onboard the advice about NIC tests and am currently setting up stress testing sender/responder tests to see if I can replicate the problem. I've also taken onboard advice about potential power issues and am asking our electrical experts to investigate. If only I could isolate the problem and make the failure repeatable .....

- R
- Robert Redelmeier
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Dec 8, 2005 7:31 PM

Further to Tom's advice, you should try to make this problem repeat. That would make it software (including firmware). If it won't repeat, it is some electrical/hardware transient.

Try some stress testing in off-peak hours (eves, wends?). `ttcp` can easily saturate a network, and I would run it on at least four stations going through the suspect switch.

-- Robert

>

- W
- w_tom
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Dec 9, 2005 3:02 AM

And that is your #1 task. Unfortunately, others don't have sufficient technical experience to appreciate how things are solved. An so we have this other problem - teaching others about reality. Good luck with your testing. And don't forget to report back. Its a two way street. This is how we all learn.