complex design issue...

- J
- Jason
  
  Contact options for registered users
posted
19 years ago

Wed, Mar 23, 2005 3:59 AM

----enet---- : denotes Ethernet

----e1---- : denotes E-1 leased facility

Two main sites: R1a----enet----R1b----e1----R34a----enet----R34b

Chain of routers: Group one: R2----e1----R3----e1----R4----e1----R5----e1----R6----e1----R7----e1----R8

Group two: R8----e1----R9----e1----R10----e1----R11----e1----R12----e1----R13----e1----R14

Group three: R14----e1----R15----e1----R16----e1----R17----e1----R18----e1----R19----e1----R20----e1----R21

Group four: R21----e1----R22----e1----R23----e1----R24----e1----R25----e1----R26----e1----R27----e1----R28----e1----R29

Group five: R29----e1----R30----e1----R31----e1----R32----e1----R33

Each of the start and end routers in each group also have connections to two of the main sites, like this:

R2----e1----R1a R8----e1----R1a and ----e1----R34a R14----e1----R1b and ----e1----R34b R21----e1----R1a and ----e1----R34a R29----e1----R1b and ----e1----R34b R33----e1----R34b

Crude ASCII-art diagram below: all connections E-1 except as noted. 1 2 3 4 5 6 7 8

12345678901234567890123456789012345678901234567890123456789012345678901234567890 R1a----enet----R1b-------------R34a--------enet---------R34b /|% | + #| * | \ / | % ###|###+########### | * | \ / | % # | + | * | \ / | % # | + ********|******************** | \ / | % # | +* | | \ / | %# | *+ | | \ / | # % | * ++++++++|+++++++++++++++++++ | \ / | # % | * | + | | / | # %%%%%|%%%*%%%%%%%%%% | + | | | | # | * % | + | | | |# | * %| + | | R2 R8 R14 R21 R29 R33 | / \ / \ / \ / \ | R3 R7 R9 R13 R15 R20 R22 R28 R30 R32 \ / \ / \ / \ / \ / R4 R6 R10 R12 R16 R19 R23 R27 R31 \ / \ / \ / \ / R5 R11 R17?-R18 R24 ?R25- R26 I can?t change the E1 connections (customer provided)

possibilities: OSPF area 0 is the R1a-R1b-R34a-R34b chain of routers. OSPF area 1 is R1a-R2-R3-R4-R5-R6-R7-R8-R1a OSPF area 2 is R1a-R8-R9-R10-R11-R12-R13-R14-R1b-R1a OSPF area 3 is R1b-R14-R15-R16-R17-R18-R19-R20-R21-R34a-R1b OSPF area 4 is R34a-R21-R22-R23-R24-R25-R26-R27-R28-R29-R34b-R34a OSPF area 5 is r34b-R29-R30-R31-R32-R33-R34b

How to handle the R1a-r22, R1b-r29, R34a-r8, and R34b-r14 connections? Area 0?

I have strong doubts that this is going to work with any routing protocol. For example, how to make the traffic originating at R5 default to the R4 link, floating static I think.

How will a failure of R2 propagate? R3 should reroute traffic to R4, R4 to R5, etc. until R8, which should send it to R1a or R34a, depending on ultimate destination (other networks the other side of R1 and R34.

The idea is that a failure of any one router will affect only that site, no others. Or if Site 1 goes down, then all traffic can go to Site 34 via one of the other connections. Let?s say Site 1 is dead, and the connection from R8 to R34a is also down, then traffic from R2 thru R13 should get forwarded to R14 (via the chain of routers) which will then forward the traffic to R34a. So, in general, the concept is a highly resilient network. I just have doubts that the daisy-chain of routers is going to perform anywhere near what the customer wants.

Customer also has requirement for recovery (i.e. re-convergence) from any router/link failure at max _eighteen_ seconds. This network will support a life/safety application. (hmmm, doesn?t Cisco have a ?don?t sue us if you use this in a life/safety network and someone dies? clause in the license agreement? )

Comments, suggestions welcome. Can?t provide much more technical info as a NDA applies. Opportunity to re-connect the routers in a differing fashion are severely limited.

Don?t even ask about the IP address structure the customer proposed (and has written proprietary device code to) that simply will not work.

Jason.

remove the obvious to get my email........

- T
- thrill5
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 2:02 AM

Whoever designed this network forgot the golden rule, the KISS principle (Keep It Simple Stupid). I have never been a fan of OSPF because it doesn't scale without a lot of manual configuration. EIGRP scales extremely well and would work fine in this network. The biggest problem I see is that there are too many paths and too much redundancy. After a certain point, adding more redundancy does not increase reliability it decreases it because the routing becomes very complex (as in this scenario) and make trouble shooting difficult. If going the EIGRP path, use summary-routes on as many links as you can, as this will keep the convergence time low.

R14----e1----R15----e1----R16----e1----R17----e1----R18----e1----R19----e1----R20----e1----R21

R21----e1----R22----e1----R23----e1----R24----e1----R25----e1----R26----e1----R27----e1----R28----e1----R29

12345678901234567890123456789012345678901234567890123456789012345678901234567890

- J
- Jason
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 2:49 AM

Scott, thanks.

Well, more bad news for this network. Ultimately, it will not be Cisco equipment. So EIGRP is out. Cisco will be used for the lab and pilot. Some 'ruggedized' router will be used for production. Made by a company that so far I've been unable to find a successful router product from.

The topology below does a pretty good job of providing circuit redundancy, while minimizing the number of E1s required. Also, this network is geographically linear. Think of, oh, say exits along an interstate (that's not what this is, but the linear nature is pretty close) Site 1 and Site 34 are at the two ends. Travelling along the interstate, you need to communicate with local devices to find out if the way ahead is clear. This is done via a parallel RF network that interfaces to this network at Site 1 and Site 34. Based on your location, you talk to a local device.

A very large problem is actually a result of all the redundancy: when site one goes down, and the traffic from Site 2 propagates down the chain of routers finally getting to Site 8, where, ooops, the connection between area 1 and area

0 is broken, how to pass the traffic out of area 1 into area 2, and get it forwarded to site 34?

Ow, my head begins to hurt.

I think I'm going to have to figure out how to tell the customer this design (dreamt up by non-IP-network types) simply will not work, and give me a week to come up with a decent solution. But they are not the ultimate customer, he/she/they/it are half a world away, and they control the E1 provisioning.

More ideas?

Jas> Whoever designed this network forgot the golden rule, the KISS principle

- S
- stephen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 12:30 PM

well - maybe.

i didnt see the original post, but 30+ routers is fine in a single OSPF area.

EIGRP scales extremely well

R2----e1----R3----e1----R4----e1----R5----e1----R6----e1----R7----e1----R8

R8----e1----R9----e1----R10----e1----R11----e1----R12----e1----R13----e1---- R14

R14----e1----R15----e1----R16----e1----R17----e1----R18----e1----R19----e1--

--R20----e1----R21

R21----e1----R22----e1----R23----e1----R24----e1----R25----e1----R26----e1--

--R27----e1----R28----e1----R29

1234567890123456789012345678901234567890123456789012345678901234567890123456 7890

Scaling with OSPF usually gets to be an issue once there are multiple areas (because OSPF is only a link state protocol within an area, not between them).

so the 1st Q should be - do i need areas at all?

it doesnt seem to make much sense to deploy a protocol optimised for complex networks and then confine the resilient protocol to subsets of the network that arent really resilient.

the main reason in this network may be the designer wants to summarise routes at area boundaries - again you need to check if the scale is such that you care to add the complexity and the risk that the summarisations will break the resilience.

golden rule is set up the routing protocol properly, put in statics as close to where the static "starts" in the cloud, and let it propagate. if you need lots of statics for similar routes scattered across the network then the design is wrong.

1 area fixes this.

the reliability is going to depend on link reliability (which if they are radio is a problem, or if they are off an SDH / fibre link system should be fine)- a decent modelling package can let you figure out what is likely

you will have to alter the default timeouts - the dead timer + convergence is going to have to be sub 18 sec combined, and the default dead timer is around 40 sec..... Cisco also only re-eval the link state database every few seconds.

maybe you need a reliable underlying network? SDH is the standard carrier tool for this, and if you have the capacity and money and right kind of underlying pipes, you can get link recovery in 50 mSec.

high speed convergence in ISP backbones often uses IS-IS, where the hello times can be sub second - not sure if that is practical with these link speeds though - you dont want minor hits or flapping links to cripple the network.

- V
- Vincent C Jones
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 2:23 PM

Interesting challenge. You'll probably get lots of recommendations to "just use EIGRP" even though this is a classic worst-case topology for EIGRP. I'd stick with a link-state protocol (OSPF or IS-IS).

Unless your diagram is only a small sample of the final implementation, you can resolve your concerns with OSPF simply by expanding backbone area zero to include the second layer of interconnects (so it contains R1a, R1b, R34a, R34b, R2, R8, R14, R21, R29, and R33). The assignment of other areas then becomes obvious.

You will need to tune the timers to get your recovery time within spec. This could be a challenge if the network expands by an order of magnitude, as you need to not only speed up the hello/dead times, but also the database update and propagate timers. If the previous sentence is not "preaching to the choir" and this really is a life/safety network, get some help from a competent consultant before you make any commitments.

Also, don't forget the monitoring and management side of the network, the topology cannot tolerate multiple failures, so any time a link or site fails, you're playing "beat the clock" to get it fixed before the next link or site fails. Note that no matter how good a job you do, you're looking at a finite probability of disconnects of working systems, a routing protocol is not a substitute for disaster recovery planning, nor does it take a disaster to be a disaster from the networking viewpoint.

Good luck and have fun. And keep your liability insurance is up to date.

- H
- Hansang Bae
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Mar 24, 2005 6:35 PM

Surely you jest. this is *THE* scenario where I would stay away from EIGRP.

- J
- Jason
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 25, 2005 1:52 AM

Well, it won't be EIGRP because the ultimate production network will use ruggedized routers, not from Cisco (a whole 'nuther ball of string, as they say). Going w/ OSPF so far. Both you and HSB have said that this is a "classic worst-case" for EIGRP (or similar words...) Why?

This is the complete network that is of concern. There are some other parts that feed this one, but they are trivial at this point (ISDN/PRI access from an RF network controller at site 1 and site 34, for example). I had planned that Area

0 be the four core routers (2 at site 1 and two at site 34) and all their interfaces, as well as the interface on the router at the other end of those circuits. Then as you say, the other areas become obvious. However, I have concerns that if/when a failure occurs in such a way as to make an area disconnected from area 0 (say, lose the circuits from R2 and R8 to area 0), then there is a path via area 2, but there is no longer a link from area 1 to area 0. So, virtual link, you say? ahh, but how to make that dynamic, so it comes into play _only_ under certain failure scenarios? I haven't figured that out yet.

So, my three options right now are: A) everything in Area 0, B) split up into multiple areas and deal with virtual links and redundant/fallback paths, or C) come up with a new topology.

Well, the last statement first: I _am_ the competent consultant. Unfortunately, I was brought into this design much more than 3/4 into the project. My customer has already sold the design to their customer.

Yes, had a long conversation today with my customer and his management. The one thing that could get the end customer to come up with more E-1s is the inability to get this network to converge/recover quickly enough in the event of an outage. Another earlier reply alluded to the quality of the E1s as they are being provided by the customer off their own SDH network. We discussed how a dynamic routing protocol works, how changes are not immediately transmitted to everyone, how things have to expire and propagate, etc. How one router tells another, who has to do things before telling another, etc. I also brought up that changing the hello timers and hold-down timers and garbage-collect timers, etc. can be done, finding a fine balance between convergence time and overwhelming the network with OSPF overhead could be touchy. The life/safety part comes in because the ultimate device receiving information from this network is self-powered, and very heavy, on very fixed pathways, and very long, and very hard to stop in a short distance. (did that give you enough hints without violating the NDA? :) )

I brought this up again today as well. The documents I'm working from have no mention of management, security, performance monitoring, "nuth'n". I made some recommendations. Will have to see where they go. My customer has thought about a couple of these things, but I don't think they gave it much more than, "oh, the equipment vendor has software for that...." All I can do is make the recommendations that they examine the issues. Your other points about being a substitute are well understood (by me, at least, I'm trying to get that thru to my customer...)

Yeah, I hear that about the insurance. The good thing is that the ultimate implementation is in another hemisphere, and I doubt that that country much cares about some round-eye's legal liability - they tend to execute their own people rather easily. :=<

Thanks for the comments, I'm actually hoping I can get this contract over quickly, because it is such a nightmare. Lesson to self: look closer at the scenarios during the interview process.....

Jason.

- W
- Walter Roberson
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 25, 2005 5:11 AM

In article , Jason wrote: :. So, virtual link, you say? ahh, but how to make that dynamic, so it :comes into play _only_ under certain failure scenarios? I haven't figured that :out yet.

Read Vincent's book ;-)

I'm bogged down a bit at the moment on the chapter "50 Ways to Lose Your LAN (and Survive)"