Uplink drops intermittently...

- S
- scott
  
  Contact options for registered users
posted
18 years ago

Thu, Apr 20, 2006 7:46 AM

Hi,

Configuration: 14 PC's Side A; 5 PC's on Side B Side A: 1 x 3Com 24-Port Switch ->uplink to... 1 x 8-Port Switch (Have tried Intel & Asus) -> uplink to... Side B: 1 x 8-Port Switch (Have tried SMC & identical Asus) (Switches have auto-sensing and 3Com has uplink port.)

My uplink between two buildings was working fine for 5 months until it started dropping intermittently - once a day, once every two days or more often, there is no pattern. When the link goes down Side B can still browse pc's on their segment and 3 pc's on Side A, connected to the same switch that has uplink to Side B, can still browse Side A. The original cable run was CAT5 STP of 160m. I have split the cable into two 80m segments with an Intel switch in the middle without success. I have replaced the cable without success. I have currently got a 60m cable in place which runs directly between the buildings, thereby eliminating the possiblity of EMI from factory machines or heavy duty power cables. With this 60m cable I have placed two identical Asus GigaX 1008 switches on either end without success. I have run Etherpeek to monitor packets over the network without any unusual traffic ie. no excessive broadcasting or flooding. I have run AntiVirus and Anti-Spyware scans on all the machines in Side B without any problems detected. I would like to know if someone has had a similar experience and what possible solutions I can try. I would like to know the cause before I recommend a fibre or wireless link between the two buildings, and hopefully I can go back to using the 160m run which worked for so long.

Thanks.

- M
- Manfred Kwiatkowski
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Apr 20, 2006 10:09 AM

As long as one pc at A can reach another on B it is not the uplink as the describtion below confirms. It can be about anything from bad memory in the 3Com up to broken hard- or software messing around with MAC-addresses. As long as SoHo-switches are used at So and not only at Ho the only cure is to replace _everything_.

[ description of activities on innocent link removed ]

- S
- scott
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Apr 20, 2006 10:40 AM

Hi,

Thanks for your response, however the option to replace all hardware is not a solution unless we know that all hardware has failed. Memory on the 3Com switch wouldn't make sense because Side A still has a connection as well as 3 pc's on Side A which are connected to the same switch that runs the uplink to Side B and connected to the 3Com. There is no evidence of a faulty network card or software creating broadcast storms, or any indication of enough traffic to take down a network and even if that was the case, why would it always drop the one specific uplink connection? No new software or hardware has been added to the network since, or before this problem begun.

- M
- Manfred Kwiatkowski
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Apr 20, 2006 11:22 AM

The memory error could affect specific bits only and as some addresses may have this bit set (or unset) some devices will work and others will not, maybe intermittently and maybe on some ports or portgroups only.

- W
- Wrolf
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 10:19 AM

Sounds to me like you have pretty much cleared the physical and datalink layers - my only question being the link between the two sides, is this fiber optic?

But you should probably be looking at the layers above. Check the IP addresses and masks, check the arp caches on the different machines, check the mac forwarding tables in the switches, check the DHCP range and are there any hosts with addresses in the range that were not allocated by the DHCP server, etc. Try pinging the broadcast address - do you get multiple responses from a single host. Also check for consistency of which switch believes it is the root bridge. That should pretty much check the network layer, and the bridging part of the datalink layer.

Next I would go to DNS and NetBIOS naming - are they consistent? Does each machine have a single DNS/NetBIOS name (on Windows, use nbstat -a )? Are they all in the same domain or workgroup.

One intermittent problem that I had was someone bringing in a laptop from home that was in the default MSHOME domain, which conflicted with the customer's domain. Endless intermittent problem. Fortunately for me I noticed it quickly with nbstat -a.

Also, try expiring all the DHCP leases.

Hope this helps at least eliminate some possibilities. Let us know...

Wrolf

- S
- scott
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 1:18 PM

Hi,

I was fortunate enough to get a packet capture which covered the time the link dropped. The last packet from a PC on Side B shows it's NIC as running at 10Mbps and not 100Mbps which it should be. It was an ARP request of only 64bits - I'm not sure if that's correct but that's a very small packet. The Ethernet Broadcast was to destination FF:FF:FF:FF:FF:FF and its local MAC was listed as 00:00:00:00:00:00. I've given the user another laptop to work on and pulled the laptop in question off the network. She also then mentioned that her laptop is always the first one on Side B to lose its connection. That was at around 11am today, its now after 3pm my time, and they haven't been down yet so we'll see. The laptop I took off the network was also used by a previous user who used it between his home and the office, so I can't rule out the possiblity of damage done at an earlier stage.

Thanks

- A
- Al Dykes
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 1:56 PM

Try nailing the NIC speed on your PC to 10MB/half duplex.

You might also try a new factory-made patch cable.

- G
- googlegroups
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Apr 21, 2006 3:34 PM

How were you able to determine the NIC's speed by looking at that packet?

/chris

- D
- DMFH
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Apr 28, 2006 9:25 AM

I hope your problem has since been solved, but I wanted to add to this thread in the hopes maybe the information / thoughts will be useful to you or others that experience something similar. Your posting was excellent with a lot of information about the problem - thank you!

One thought I had is that adding the Intel switch in the middle in the cable run between "Side A" and "Side B" might have created a problem. A common function of Ethernet switches is to keep a table of MAC addresses attached to switch ports local to them. At some point in time, these MAC addresses are aged out, or removed from the table if no traffic crosses the port they are assigned to. Not all manufacturers use the same time to age the MAC address table with, and this can lead to inconsistent performance problems. The effect of this issue as I've seen it can be that it takes some amount of time for traffic to cross the switch and the MAC address table to re-populate itself, or, the switch, having no current information about MAC addresses, chooses to flood all traffic to all of its' ports, an effect I've heard named "unicast flooding" - this can clog each station with the traffic of many stations.

What gave me this idea was your clue of seeing a PC issue an ARP frame. Since ARP timeouts on a host are usually in hours, either the PC never talked to the host before or just booted up, or there was some NIC issue that flushed the ARP table.

Hope all is well.

/dmfh

- S
- scott
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, May 16, 2006 10:39 AM

Hi,

Thank you all for your help and insight. Unfortunately it seems, once the overheating SMC switch was replaced, the network hardware was not the cause of this problem but rather a faulty power supply on the CCTV server. The CCTV server was found to have no response, other than the power light turning on. The power supply was subsequently replaced but the problem continued. This server was then unplugged from the network and still the problem persisted. We shut the server down but still had no luck until we eventually unplugged the server from the power outlet in the server cabinet. All has been well for, going on, two weeks now. I think the moral of the story is that when you think you've left no stone unturned, think again!

Thanks again.