Connectivity Problem with WS-C3750G-24TS and Broadcom BCM5708C SLB

- T
- Torsten Steuerer
  
  Contact options for registered users
posted
16 years ago

Fri, Feb 22, 2008 1:11 PM

Hello,

I have a problem with Broadcom Smart Loadbalancing and Cisco WS- C3750G-24TS.

We have a stack consisting of two Cisco WS-C3750G-24TS. A Dell PowerEdge Server with two Broadcom BCM5708C Cards was connected to the Stack; one Card per Switch. The two Cards were teamed with Broadcom Smart Load Balancing, using the Broadcom Advanced Control Suite.

Every about 20 minutes or so, the Dell Server was able to ping e.g., machines A, B and C, but not D, E and F. After one or two minutes, every address was pingable again. Machines D, E and F were reachable from every other place within the Network. There is no log entry in the Cisco or in the Windows event log correlating to the problem.

We upgraded the NIC Drivers, changed cables, without any result. When we remove one of the Broadcom NIC connections, the connection remains stable.

The strange thing is that we have a lot of similar configurations (3750 Stack connected to Dell Servers with teamed Broadcom NICs) running without any Problems.

The Switch Port configuration is as follows:

interface GigabitEthernet1/0/4 switchport access vlan 10 switchport mode access macro description cisco-desktop spanning-tree portfast spanning-tree bpduguard enable

IOS Version is 12.2(25)SEE2

The particular Switch that gives us the Problem is running since 1 year and 10 weeks; though I am not a 100 percent sure; I believe that the above configuration was running on that switch as well without problems for months (I don't get always informed what our Serveradmins are doing :-( )

Any clues?

Thanks in advance,

Torsten

- P
- Peter
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Feb 23, 2008 1:14 AM

Hi Torsten,

I have been involved in a similar setup (in our case using a stack of

2 x 3750 48 port switches and MS Teaming on the Servers), and seen something very similar, IE periods where things just "go to sleep"... switch wide! It all depended on the MODE of teaming being used on the Server. Now I am NOT a Server person, but was discovered a configuration change had been applied on the Server end and was able to come up with a work-around that did what we wanted for our setup using the following.

If I understand it correctly, MS Teaming seems to provide about 4 possible configuration modes for Teaming operation (all using just 1 IP address on the server) - 1. Load balancing using Single MAC sharing. 2. Path Redundancy using Single MAC sharing. 3. Load balancing using 2 MAC's. 4. Path Redundancy using 2 MAC's.

When configured for Mode 1, we had an almost identical situation to you, except in our case the switch just dumped its entire MAC address table and refreshed (very slowly), which seemed to take quite a bit of time, so it used to stop ALL traffic on the switch stack (IE all 96 ports) until something was done to sort out the issue.

My theory was that Mode 1 was presenting an IDENTICAL MAC on 2 different Switch ports, causing an issue for the 3750's MAC table.

We found that Mode 2 worked fine, which gave us what we really needed, a backup or alternate path. Only 1 real path was ever active at any one time, the other path NEVER saw the MAC of the server on it until the first PATH had died.

The down side of this is that it also means you are vulnerable to the Server not doing something stupid that re-enables path sharing (highly possible with an MS environment).

Modes 3 & 4 should be fine as only 1 MAC per port was ever seen by the

3750.

So if you use Teaming for redundancy purposes, and your version of Teaming allows this, and works the same as the MS way, then it should be possible to get it to work.

Ours was a failure about every 20 minutes or so as well, except it would knock down the entire stack and take 5-10 minutes to recover.

Nope, not a squeak anywhere on what was going on. Our only clue was that Network access to the 3750 stack completely died until the MAC table had been re-built, but by then everything looked fine again.

Exactly... its my guess you are getting a single MAC appearing on 2 different switch ports, and this is causing the switch to choke..

How is the Teaming configured on these? We also had a set of Servers using MS Teaming running fine for abut a year, but it was only when new Servers were added in a different MODE of Teaming that the new problem emerged.

Pretty much identical I Think , except we were at SEE3.

Let me guess, they were "tweaking the way things were configured" to get it to run better?.....;-)

Good luck.........................pk.

- T
- Thrill5
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Feb 23, 2008 8:20 AM

Have you tried creating a port-channel and putting both of the ports into the port-channel? LACP will automatically allow ports to join the port-channel, but you need to create the port-channel interface first.

See this document.

formatting link

- T
- Torsten Steuerer
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Feb 26, 2008 1:58 PM

I asked our Server Admin to point me to a similar configuration like the one that is causing problems. Unfortunately, he is on Holiday at the moment, so I checked on a few Servers where I got access to. On four Servers I checked, I found !!! three !!! different configurations.

- Adapter Fault Tolerance

- Adaptive Load Balancing

- Switch Fault Tolerance

Though all those Servers have Intel Cards which obviously don't cause the same problems, the Server configs seem in no way being consistent.

Or on the n-th Server, they tried the n-1th config (see above) %-}

Greetz to NZ,

Torsten

- T
- Torsten Steuerer
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Feb 26, 2008 2:01 PM

I thought about that option but that means additional config and management burden for me which I want to avoid. Especially since I can never be sure that somebody is repatching something.

Torsten

- B
- Bod43
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Feb 28, 2008 2:49 PM

I agree that the LACP scheme is the way to go.

Sadly it seems that Server NIC configuration is not considered a networking job in a lot of places.

ServerAdmin - "It pings so the network must be OK."

- T
- Torsten Steuerer
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Mar 4, 2008 10:11 AM

Finally I found the reason. One machine in the net was poisoning arp caches. Happened like this:

Havin three machines, DBServer, AppServer and BadGuy.

From Time to time, Bad guy was sending an ARP request for DBServer. In that ARP request, the sender address consisted of the MAC address of Bad Guy and the IP address of AppServer! So DBServer overwrote his ARP cache entry for AppServer; now pointing to the MAC address of BadGuy and was now unable to reach AppServer. Thus until the "real" AppServer sent an ARP request for DBServer.

This behaviour was most likely caused by Broadcom NICs in combination with VMWare. Unfortunality so far I could only find this blog

formatting link

entrys 1 and 7 (Broadcom confessed the ARP cache poisoning problem) pointing out the issue.

How could I ever blame my good ol' Ciscos ;-)