tolerating loops in a network but not using STP

- R
- Rahul
  
  Contact options for registered users
posted
14 years ago

Thu, Feb 25, 2010 11:46 PM

I'm designing a low latency ethernet network for ~300 servers (a compute cluster). End-to-end latencies are pretty low so that switching latency becomes critical. Each additional switch hop adds, say, 10% more. Most switches take ~40 ports. The best topology, in theory, is a mesh where each switch has a direct connection to every other switch. (I need about 8 switches total) This let's any two ports talk with max two switch hops one- way.

Unfortunately, broadcast traffic and loops can't normally coexist. I could do Spanning Tree but that just breaks loops. That defeats the network design of a highly-connected mesh.

To tolerate loops I could use routers and seperate subnets. But routing seems more expensive and higher latency than switching. Again defeating original purpose.

Are there any other cretive options? Are there protocols / switches that will selectively apply STP to only broadcast traffic?

[Bandwidth is not as big as concern. Latency is what I am after. Reliablity and redundancy are not so important. Just need hardware that does fast packet switching. No fancy switch magic. No QoS, VLANs etc. All attached equipment is of same speed. Network topology is reltively static for long periods.]

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Feb 25, 2010 11:59 PM

There are layer 3 switches. The exact difference between a layer 3 switch and a router isn't so obvious, but lower latency might be one.

Well, even if you do that, you will still get loops of unicast data which seems likely to slow down other traffic. Most switches now are store and forward, but there used to be discussion of ones that would start sending before the packet was completely received. Well, that only works if both ports are the same speed, and does have the disadvantage that defective packets (collision fragments, etc.) are forwarded.

It seems that VLANs might also do it. I think that would require VLAN-aware hosts, such that traffic could be sent on the appropriate VLAN.

-- glen

- R
- Rahul
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 12:37 AM

glen herrmannsfeldt wrote in news:hm72t5$i3m$ snipped-for-privacy@naig.caltech.edu:

loops of unicst data should only happen in the startup period when a switch is not sure what port to send a unicast packet down, right? After that the switch cannot end up looping a packet, I think.

What if I kept STP on for the startup period and then turned it off after stabilization of the unicast traffic? Of course, assuming some creative way of dealing with the broadcast traffic is found.

Right. In fact, the ones I am exploring are indeed cut-through. Maybe the technology has a full circle. At least when one does not do fancy stuff deep packet inspection is not necessary to know where to forward a packet.

But these switches still have a small latency. And the more hops the packet makes the worst this problem gets.

That constraint is OK with me. I have identical hardware. No mixing.

Ture. No checksums. But for lower latency one can live with this. Especially if the interconnect is not very noisy.

That's something new! Thanks! How would I do it with VLANs? I'm not seeing the solution you suggest. But that might be my way out. So I'd like to know more.

All hosts are Linux clients so, in theory, I could do whatever tweaks are needed to their routetables, ARP caches etc.

- R
- Rick Jones
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 1:53 AM

At the risk of typing words into Glen's keyboard, each VLAN would be a separate IP subnet, and you would probably do switch-specific things to make each of your N meshed switches the root of a different VLAN's spanning tree. You would then have a priori knowledge in your applications that node Fred was "closer" to node Ethel via VLAN 12 than VLAN 14 and to reach Fred you would use the IP assigned to Fred's VLAN 14 "interface." (Assuming of course your comms were still all IP based)

Of course fully meshing N switches will start to consume switch ports

- N-1 of them on each switch if I've done my math correctly.

rick jones

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 2:00 AM

Rahul wrote: (snip, I wrote)

Consider the case where you have N hosts, each with a direct ethernet link to each other host. That is N*(N+1)/2 links. For small N, you might even be able to do that.

Instead of that, you could arrange N*(N+1)/2 VLANs, each host configured to route to the appropriate one for each destination. I think that works for N a little bigger than above, but maybe still not big enough.

Next, connect M (with M All hosts are Linux clients so, in theory, I could do whatever tweaks are

- R
- Rick Jones
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 2:14 AM

The latest incarnation of that would be when asked if one is a Doctor/Laywer/MiracleWorker one answers "No, but I stayed at a Holiday Inn Express last night." :)

rick jones

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 2:44 AM

Rick Jones wrote: (previously snipped reply by me)

Yes, similar to the configuration with links between each pair of hosts, where you need N*(N+1)/2 links, each with its own IP subnet. I used N for the number of hosts, M for the number of switches, hopefully a little less than N. So M switches, with M*(M+1)/2 VLANs, each an IP subnet. That gets you two switches between each host pair. I am not sure what the arrangement would be for three or four between each host pair.

Yes, but somewhat easier than putting enough ethernet NICs into each station. Well, with four port NICs you get partway there.

There might be some interesting combinations of four port NICs, each connected to four VLAN switches, but not all switches connected to all other switches. Maybe each switch connected to 1/4 of the other switches...

-- glen

- A
- Albert Manfredi
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 9:49 PM

Why does spanning tree defeat any purpose of the mesh?

If you have a fully connected mesh, RSTP will create very short branches in the spanning tree. Ideally, only two switches max in the path. That's one good thing. And if a break occurs in a link, RTSP will soon find an alternate path.

Routing doesn't tolerate loops either. Routers may uise, for example OSPF, to create paths from source to sink with no loops. The reconfiguration times of RSTP and OSPF should, if anything, favor RSTP. This was not true for the old STP, but it should be the case now.

Bert

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 10:43 PM

Albert Manfredi wrote: (snip, someone wrote)

He wants a bunch of switches connected together such that each packet takes the shortest path through. If you have M switches, each connected to M-1 other switches, you can get from anywhere to anywhere through only two. I don't think spanning tree will do that.

For fully connected, two switches max, including the start and end switches, requires it to take the direct link.

-- glen

- S
- Stephen
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 11:06 PM

Try some modular switches to cut the number of hops.

i mainly work on cisco these days, so not as familiar with other kit.

a big Cisco Catalyst 6509 will handle 48 10/100/1000 ports per blade and 7 or 8 blades, so you can get all your ports into 2 boxes.

formatting link

are lots of blade types for different trade offs in bandwidth vs money, so you need to talk to someone who knows how to put the config together.

if you need faster ports then the Cisco Nexus units give high density

10G and FCoE.

mind you - not cheap......

formatting link

others make chassis switches which expand to high port counts - Extreme, Foundry etc.

You can cheat with protocols that make 2 devices pretend to be the same unit for 802.1ad link aggregation. This lets you spread the 2 or more links in a group across a pair of switches. Cisco call this VSS, but Nortel Passport switches had this a long time ago (but no idea what it is called).

the classic way to cut the number of hops is to use a "snowflake" type topology - ie a star of stars.

The flip side is this concentrates traffic on the switch to switch links, so you end up with bottlenecks - which is why a big switch in the centre can make sense. In a chassis the blade to blade backplane substitutes for the stackable switch interconnect.

if you need resilience then replicate the central switch 1st - your biggest single point of failure is then an edge switch.

That gives you at most 3 hops across the network.

Most high end L3 switches have around the same forwarding delay for L2 and L3 - after all the packets follow the same flow thru the hardware....

Stackable switches? Think of it as a switch backplane on short special cables.

1 thing to remember is that faster links and switches with fast ports will tend to have lower latency.

So all other things being equal 10G interconnects between switches should reduce latency compared say multiple 1G in link groups.

look at some test reports to get an idea of what has been proven, but google will find plenty more....

formatting link

- S
- Stephen
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Feb 26, 2010 11:17 PM

i would want to see some test results before i went this way (as some "cut thru" ethernet switches were actually slower than store and forward).

cut thru died out for good reasons, and just going faster doesnt invalidate that.

Well cut thru only works if most of the time the target network is idle.....

Have you thought about just going 10x faster and store and forward?

The other bit of latency build up is congestion and buffering delay, and the quick fix for that has traditionally been run the network at low load levels.

- R
- Rahul
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, Mar 1, 2010 12:45 AM

Stephen wrote in news: snipped-for-privacy@4ax.com:

Thanks for all the tips guys! I am using 10G switching and cards. So not much by way of hardware that I can change. But even with 10G switching there is a finite amount of switching delay. With traditional 1G switching the major chunk of the latency was in the host adapters. But with my newer adapters the switching delays are starting to get major. Hence the reduction of hops goal.

I'm going to explore stacking etc. ; let's see if one of them solves this issue.

Ultimately, though, STP seems sort of a solution solving a wrong problem. STP seems to treat loops as essentially evil and meant to be broken. That seems conter-intuitive to me.

Most "natural" networks have loops [water piping, electric grids, roads etc]. Maybe ethernet networks are fundamentally different in some sense? Or is it only a protocol-specific idiosyncracy.

Loops could be a way to provide redundancy, congestion alleviation or simply shorter paths (like my situatuion). Out in the broader internet loops seem to be tolerated and even encouraged. Why doesn't this rationale tricle down to switched-only networks?

Wouldn't there have been better ways to deal with loops? e.g.

(a) Use STP identified loops-to-be-broken to identify switch-ports that should not push broadcasts out of (b) Using a TTL on broadcast packets

....or some other smarter way.

But to de-link every link that could cause a loop and make every network loopless seems a very harsh measure. [Then again I am a networks newbiee so maybe I don't see the obvious elegance of the STP approach! :) ]

- C
- Char Jackson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, Mar 1, 2010 11:07 PM

I'm no expert but I have the opposite perspective. Loops seem to be evil to me. Multiple parallel paths are fine, but a path that feeds back into itself (a loop) doesn't seem like a good idea at all.

Do you have an example of a water piping loop or an electrical loop? I can't think of any real world examples, nor can I think of any advantages of these kinds of loops. Again, multiple parallel paths are one thing, but loops are quite another. I don't consider the road system to be an example of a 'natural network'.

It sure sounds like you're confusing loops with multiple parallel paths. Multiple paths can be managed by routing protocols such as OSPF, but loops are essentially black holes and thus are not tolerated or encouraged. Once a packet enters a loop, how would it ever break out?

- R
- Rick Jones
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Mar 2, 2010 12:17 AM

I used to live in a condo complex where the hot water was on a loop. Meant that time to hot water out the tap was quite small. At the cost of having what amounted to a one pipe radiator running through the building...although I'm sure it was well insulated.

rick jones

- R
- Rahul
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Mar 2, 2010 12:42 AM

Char Jackson wrote in news: snipped-for-privacy@4ax.com:

You are probably right. Aren't multiple parallel paths to switches loops?

e.g.

ABC

AEC

AC

Now is that a "loop" or a multiple parallel path?

- R
- Rahul
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Mar 2, 2010 12:54 AM

Char Jackson wrote in news: snipped-for-privacy@4ax.com:

I'm probably worse. :)

But with bi-directional traffic the disctinction is lost, isn't it? Unless you have a one-way "valve" of sorts. Maybe that's what the switching protocols could have?

Here's one from the electrical world:

formatting link

That seems to have loops in it. Although one could argue those are "multiple parallel paths" too. Not sure.

Why not? It seems as "natural" as a switching network.

TTL? Or maybe by not allowing a packet to enter a loop. The fact that a physical "loop" exists does not mean it ought to be traversed....

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Mar 2, 2010 12:58 AM

The ones I have seen in condos were not well insulated. Our house has one, installed by the previous owner, which is insulated. It also has a temperature sensor at the end that turns the pump off when it is warm enough at the far end. The loop then comes back to the input side of the water heater.

-- glen

- C
- Char Jackson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Mar 2, 2010 3:24 AM

Examples noted. :)

- C
- Char Jackson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Mar 2, 2010 3:29 AM

No loops, just 3 possible paths from A to C. Routing protocols would determine which of the 3 paths is best.

Am I wrong?

- C
- Char Jackson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Mar 2, 2010 4:04 AM

Not sure what you mean by all of that, but I think the answer is no and no.

Electrical power certainly flows through the grid, but I'm not aware of a scenario where it enters a loop. I (loosely) define a loop as a specific point that sees the same more than once, where is a data packet, a specific current flow, etc. I'm not explaining that very well, unfortunately.

You could certainly argue that the road system is a sort of network, but the things which travel on it are autonomous. If you get into a loop while traveling by car, it's your own fault. Data packets, on the other hand, have no such autonomy. They are addressed prior to being sent and are unable to change their own destination address while in flight.

TTL only kills a packet, it doesn't let it exit a loop. And if the solution is to avoid letting a packet enter a loop in the first place, then what value does the loop have other than being a black hole? That's why I think loops are evil and should be found and repaired.