[sorta OT] LAN reliability ?

- J
- J. Clarke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 2:19 PM

Don't confuse the reliability of the _network_ with the reliability of the _system_.

When someone says "the network went down" they usually mean that the _system_ went down, which may be a failure at Layer 4 or below but is much more likely to be a server or application problem.

- W
- Walter Roberson
  
  Contact options for registered users
posted
19 years ago

Mon, Apr 18, 2005 3:38 PM

Recently, I had it put to me that LANs (and firewalls) should be 100% reliable (barring major equipment failure) -- that networks & security should be about as reliable as the electrical mains (i.e., something that can taken for granted nearly all the time, and repairs should take only a few minutes.)

I was informed that "millions of businesses every day" have that kind of LAN reliability.

Is that level of reliability the norm in real SMBs, with 500-ish hosts, multiple subnets, and a mandatory deny-by-default firewall policy?

Which is the truer picture in a growing organization with fluid network access requirements: that the network & security person has barely anything to do because they set up the equipment "right" the first time? Or that keeping up with the network & security changes and failures and planning is more than a full-time job that can involve many a late night (or marathon repair session)?

How much truth is there, in real organizations, to those old cartoons of a skeleton with cobwebs in front of a computer terminal, with the caption "The network's down again." ?

It seems to me that more than once I've been in a major bank and been told "The network's down", and no-one, staff or customer, seemed surprised. I also seem to recall hearing a number of casual conversations along the lines of "Oh yeah, the network went down again at work today"... and I don't recall hearing anyone reply "Our network never goes down"... not for anything short of a Service Provider.

Lastly: has anyone observed a network "freak out", with a series of normally reliable devices getting confused and staying confused all through hours of standard problem isolation procedures, with no discernable reason for the multiple failures -- and for the devices to eventually settle down, and start working properly with configurations that didn't work before?

- J
- J. Clarke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 3:47 PM

And it's distressing the number of "pros" who aren't. One outfit that contracted to do some work for one of my clients turned out not to consider a post-installation certification scan to be part of the process. I finally gave up arguing with them and scanned it myself. They also made a big deal about being affiliated with Lucent. When asked to deliver the paperwork for the Lucent warranty it turned out that Lucent had never heard of them. The distressing thing about this bunch was that they were teaching network installation all over the state at the state technical colleges. Unfortunately my client was not litigious.

- A
- Al Dykes
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 3:50 PM

The answer to this question is specific to each company and can only be determined as the result of a Business Risk Analysis and guidance from the comany's senior management. In some cases governement regulations about downtime apply. The analysis is freqently stated in lost revenue and probabilities. It doesn't make sense to spend money on technical fixes to some risks. In some cases a "loss of business" insurance contract will be an appropriate way of adressing a risk.

The result of the risk analysis will tell the technical people what the critical issues are for the company operations and customers. It should also result in funding to meet the requirements. operation.

There are many aspects to "non-stop/can't fail" operation and the definition of non-stop is refined to the nature of your business. My experience is with a Very Big Bank with retail ops in the New York area with about 400 branches and a few thousand ATM machines. We had several rules;

A bank ATM transaction , once acknowledged, can't be lost.
If one ATM is down another one nearby should be operational
A worst-case disaster in the main data center should not lose any data and result in no more than 4 hours outage for lines of business other than the branch banking system, which is covered by rule #1 and #2.

This is a drastic simplification of banking on the 80's. Rule 1 and 2 was addressed by using Tanden NonStop (tm) minicomputers to control clusters of ATM machines. (PC weenies today don't know how rock solid Tandem and VAX/VMS computers are, and were as far back as 1980.)

Rule 2 was further addressed by having 5 regional datacenters, each with two mainframes (one a backup) which controlled geagraphical areas of ATMs. If one of the data centers burned up the public would be directed to drive a few miles to an operational area. This was deemed by management as an acceptable risk/cost tradeoff.

Rule 3 was addressd by a duplicte of the main data center in another state and if the main DC burned down the whole branch system and ATM machines would still operate unassisted for about a day. We had 4 hours (per banking regs) to get the backup data ceneter up and running. We did that on a regular basis. Data loss was prevented by not giving the customer his acknowledgement until the databases at the main and backup datacenters had acknowledged the update and were in sync. Nobody said high reliability was cheap.

If the network in a branch was dead it was equivalent to the Utility Power company having a bad day on that street. A "branch closed" sign would be hung on the door with directions to the nearest operating branch. Scope of failure is a big part of risk analysis and technical failures are just one reason of many for a point outage and things have to be kept in perspective.

9/11 and the recent Norteast Blackout have made contingency planning experts update their quidlines for critical business operations. I have a copy of it somewhere. It was summarized in Sysadmin Magazine Nov 2004 issue.

formatting link

It (a) wasn't a business-wide outage, (b) they had manual proceedures for essential tasks and (c) people could go to the next branch.

What's a "network"?

This paragraph is too vague to address. For starters it depends on how big and complex your network is and on what tools you have to measure and analyse your network. If you have no management tools then there could certainly be scenarios as bad as you describe. IME a rogue DHCP server on a laptop can bring down a network and be very hard to find without tools. I've seen an intermittant trojan on one PC spewing data to the Public Internet bring down a company in the damndest way because it saturated the uplink bandwidth and appeard to be a flakey ISP link until we understood what was going on. That site had NO managed hubs (against my recommendation). These would have allowed ne to identify and fix the problem in minutes instead of a _very_ long weekend.

Ethernet infrastructure (cable and patch panels) is designed to be very reliable and once a CAT5 drop is shown to be working is the last thing to assume to have failed when head scratching is going on. Modern CAT5 wiring also fits the risk control principle in that there is no network-wide failure mode. One cable _might_ go bad, but there is now way, short of the cat pissing on a punchdown block of multiple drops going bad at once. If your CAT5 infrustructure is unreliable it's because it was done by an incompetant installer and you should budget to get a pro in to make a recomendation.

Google for "business conringency planning" and "risk analysis" and you'll get lots of hits.

- A
- Al Dykes
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 4:17 PM

Agreed. Or, for a business branch site it means that the leased line to the corporate network is down. A competant company will haev contingancy plans for this. It may be non-technical, like manual proceedures.

- V
- Vincent C Jones
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 4:41 PM

The physical network infrastructure _SHOULD_ be extremely reliable. However, what it should be and what it is are two different things. The most common failure is a broken patch cable at the user end, but regardless of cause, if manual troubleshooting is required, MTTR can be abysmally long. FWIW, I've had more trouble with power than with networking. Here in the North East US, power cannot be taken for granted...

Aha, here is where you are being misled... When a user says "the network is down" 99.9% of the time, the network is still up and it is the application or server that they are using which has died. Think of how many "network" failures are cured by rebooting the PC, then tell me how that action can impact the cabling in the wall, hubs, routers, etc.

WAN links have significant failure rates, but that is why redundancy and backup links are used. In the case you cite, it is far more likely to be a software problem at the application/database level than a network infrastructure problem.

Yes, but there is almost always an explanation if you dig deep enough into the problem. On the other hand, determining root cause can be time and resource consuming, and most businesses are more interested in ending the current problem than they are with preventing it from happening again.

Warning: I have been called a "Network Management Bigot" for requesting all sorts of monitoring. However, my experience has been that if you look closely enough at how the network is ACTUALLY running, you will often spot problems before they are manifested as service outages. Examples range from marginal links which are reporting only brief intermittent hiccups on their way to total failure, to routing tables which indicate that the routes in use are not the routes you designed with the high probability that when something fails, the network will roll over and die rather than select an alternate route.

Been there, done that, been burnt :-) But its been years since I've had to worry about a network problem that couldn't wait until morning to get fixed.

- J
- J. Clarke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 5:57 PM

Whoa. The right approach is defense in depth. Horror story about a "fully open network". Traffic was grinding to a halt on a particular network operated by a university. Seems that there were several connections to different Internet service providers. The network, including microwave links etc, spanned most of the distance between New York and Boston, and the ISP connections were T1 or faster (this was back when a 10 mb/sec network was still hot stuff), so guess what most of the traffic on the "fully open" network was.

99% uptime is piss poor. That's one minute of outage every hour and a half or so. Any server on which that happens is broken.

Further, having a server out may not affect system reliability at all. With that many servers I would hope that you have some redundancy implemented.

Maybe the norm where you are. Perhaps you need to look at why your system is so unreliable. And if you're focussed on "Windows" and think that eliminating Windows would solve the problem then you're not really looking at the problem.

If you're running XP and your desktop machines in a place of business are "out of action" "fairly frequently" you need to find out why and fix it.

- A
- Al Dykes
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 6:03 PM

There are lots of parts in a "network" (your word).

There are Data Center clusters that have been providing literally uninterupted service for years and there may be a Tandem system that has been running for a decade with no downtime. These systems can fix hardware and software on the fly. The limiting factors to uptime can be company mergers and relocations and fuel for the generators.

Today the technology is highly distributed web servers based on BEA WebLogicServer and IBM WebSphere running on many servers at multiple locations.

The current phrase is "carrier grade" (Telephone industry terminology) for computer systems that deliver "5 nines" uptime (99.999%) and the ability to swap hardware and do software upgrades without service disruption. That's still 5 minutes/year.

- A
- Al Dykes
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 7:54 PM

Policy said that we did a genuine drill every 6 months unless events in the Real World caused us to use the backup. In the late 80's in Manhattan there were enough little disasters that we switched data centers on a regular basis and rarely had to do full fire drills. After every event there was a post-mortem analysis to see what didn't work and what we could have done better. In an operation this complex some (hopefully) little thing doesn't go as expected. We switched to backup site whenever it made sense. It was a straight forward operation. We had huge ringbinders with contingency plans for different scenarious.

I'm out of this now but I understand that this newfangled thing called the Internet and the experience of 9/11 shows that the hot/standby pair strategy is weak and both sites need to be working in production capacity in parallel to be able to say to your Chairman that you're as ready as you can be for the next disaster.

My working scenario when I had to explain disaster scenario planning was that the Vogons would lift our main operations building (or our backup site) off the planet, with data and staff, instantly with no notice and we needed to continue to meet business obligations when that happened. Once you've planned for this every other scenario is covered and if you try to enumerate all the possible little disasters and plan for them individuually you're going to miss something and get bit by reality someday.

Business Contigency Planning is a recognized job description.

- A
- Al Dykes
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 8:00 PM

Some good points here and banks may be differnt from, say, booking an taking an airline flight, in that (a) people have such a grim expectation of customer service that is very low and (b) online banking and ATMs have meant that there are fewer "gotta get to the bank by 3PM" events.

If you book a flight, show up and find they don't have you in the computer or have been overbooked you're going to be _much_ madder than the bank scenario. Stuck in traffic is similar.

Windows 95 taught people to be tolerant of computer problems at work.

- A
- Al Dykes
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 9:09 PM

For Windows on desktops on business it isn't so much the MTBF as the MTTR that makes a happy shop. With all the user data and profile on the server we just drop in a fresh pre-imaged box when a user has a probelm, hardware or software. The sick box goes on te bench for a hardware repair or reimage and reuse.

IMO MS servers running mainstrea windows applications can be very good if your expectations o0f scale and complexity are reasonable.

- R
- Robert Redelmeier
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 9:48 PM

You've received some good answers, and I'll add my 3 cents (CDN).

As many have said -- two different questions here. Networks are generally very reliable. The occasional fried port or hung router.

Security is a _whole_ 'nother thing. Ideally, the network should be fully open, and the devices (computers/servers) secure. No network security required. With Microsoft products so insecure, the network is called to help provide security by closing down. This is a PITA, and doesn't stop trojans which the network is falsely blamed for.

They do. Mostly small businesses with a few printers and a file server. The problem is that MS scales horribly.

No. If single server uptime is 99.0% from random causes, and you have 10 servers, only 90% of the time do you have all 10 servers.

This is the norm. People are loaded until they break.

The network hardware at my home & work almost never go down. I can almost always access Unix & other Linux-like hosts. However, fairly frequently MS Windows desktops are out of action.

-- Robert

- J
- J. Clarke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 10:03 PM

But were you covered for the Dark Angel scenario?

- J
- J. Clarke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 10:05 PM

But then it would not have been "completely open".

That sounds like it might be a licensing issue.

- J
- J. Clarke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 10:18 PM

There's a reason for that--it's faster than troubleshooting in most cases. The major criticism of that approach is that the user loses data and settings. On a corporate LAN neither of those should be the case.

I can generally get a Windows problem fixed or determine that it's a bug that requires source access to fix and in that case come up with a workaround, but I find that that's really _practical_ only for my home system, where chasing the bug is recreation, and not for any situation in which my time has dollar value. Cheaper to just reinstall or restore the image.

By the way, you'll find Novell Zenworks a very useful tool for Windows troubleshooting.

- J
- J. Clarke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 10:19 PM

OS/360 taught _me_ that.

- T
- T. Sean Weintz
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 10:51 PM

A few thoughts -

Rarely does an ENTIRE network go down, right? It happens, but that is pretty darn rare. PARTS of it usually go down from time to time. When that "part" happens to be your gig Ethernet or ATM core for a particular building such as your corporate headquarters, it can be rather dramatic, tho.

The same could be said for the electrical mains. In fact power here is problematic enough have diesel powered backup generators at many of our buildings. And we are in the middle a city.

In all fairness, the electrical systems do not have the complexity and constant need to be changes that most data networks have. Even voice networks are quite simple by comparison.

For localized networks outages, there are of course typically off site back up hosts, etc, that are utilized in such events. But the honchos upstairs in the building that is down don't see that everything is fine for customers. Web servers are up for the outside world to see, customer transactions are being processed, but the CEO of the company can't get to his favorite blog or check his in house email. To him, "the network is down"

I'd counter with "75% of all statistics are made up on the spot" Seriously, where did this person get that statistic? My experience certainly does not bear that out.

The smaller the org, the smaller the network, the more reliable it is IME. The less complex, the more reliable IME. I'm sure you'd consider that common sense.

That said, what you describe above fits my network pretty well. I manage it pretty aggressively/proactively - read the syslog server logs every day looking for issues, etc. I DO have outages from time to time. Most often these are down WAN links, however that is the fault of the local telco. Things like that happen when you run DS1 circuits over copper pairs that were put in place over 100 years ago (really - no kidding - many of the pairs in this city ARE that old!) So I'd have have to say the core of my network is actually more reliable than the local PTSN and the power mains here.

However, given what the users know and experience, "reliability" leaves room for interpretation. For the average end user, having an email message dropped due to it coming from a blacklisted server might be an "unreliable network" in their mind. Execs telecommuting from home, using a cable modem on a congested node that drops packets from over subscribing, thus causing the citrix metaframe sessions to drop, has in my experience been blamed on our network. Try explaining to the user that "yes, I understand you have no problems going to any websites from your home internet connection. However, the problem IS on your end, not back here at the office"

LOL. Even if set up "right the first time", it won't remain so... see below.

Full time job. The problem is the changes. New sites and office open up, old ones close. Topologies change. Access to new apps over the internet (designed by folks who consider ease of integration into your environment lastly or not at all), etc, etc. Reliability is much easier when the goal is not a moving target.

When I worked for a large bank, that I won't mention by name, rather than CHASE down our folks during big network changes, we rented hotel rooms in MANHATTAN for weeks at time, and would send our folks over to the hotel for a few hours of sleep once in a while. I once witnessed my boss staying at headquarters for over 72hr without once leaving the building. Sometimes it crosses over from being a mere "full time job" to being a "way of life". I started to know I had a problem when I started dreaming at night about PIX over IP tunnels.

Depends. I never quite got that one - is the skeleton supposed to be the user or the admin?

Which could mean anything. Most likely it means the leased line from that branch back to the main office is down. Hardly the same in my mind as the network being down.

There has long been a "blame the computer" component to our culture - it's a common scapegoat. The network has been added into that. Folks WANT to be able to have something to blame, real or not. 500 years ago it was "the devil". Today it's "the network"

Sure. I think anyone who has helped manage a network of any size has seen that at least once. Never fails that things settle down right when the hour you can start intrusive testing pops up, too.

Keeping configs as simple as possible tends to minimize this IME. Never seen it happen on, say, a network that had no vlans, all routing was done via static routes, and no multicast stuff was used, etc.

- W
- Walter Roberson
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 11:06 PM

If you have a moment, I'd appreciate an expansion on that "regular basis". I'm not quite sure whether you are saying that:

1) it was not uncommon to need to fall back to the backup data center in response to some trouble issue; or 2) you regularily tested the fallback procedures ("fire drills"); or 3) because of issues like scheduled maintenance, backups, and the like, that it was not uncommon to activate the duplicate center as a routine business continuity mechanism; or 4) on the relatively few occasions when it was necessary to fallback, that you were repeatably able to do so comfortably within the four-hour window ?

Or to put things another way, are you saying that even with all the reliability planning that the backup data centre had to be kicked up in response to a problem, or are you saying that failovers were no big thing on the occasions they were needed?

- W
- Walter Roberson
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Apr 18, 2005 11:33 PM

I think you are pointing out here that the observed failures fit within the parameters of a well-planned business risk model.

My mention of banks was only partially contextual. I would have predicted that for banks (and other major businesses) that most customers would expect and tolerate near-zero failure. But that's not what I actually observe in practice: instead I observe that people sort of sign a bit, but don't start raving about "Why can't you people keep your computers up?!?" If the lineups move noticably more slowly than the customers are accustomed to, some of them get frustrated at the extra time -- but I don't hear them getting frustrated at the "incompetence" of the bank's systems.

Thus, what I seem to observe is that most people appear to be "socialized" to think systems/network problems are a fact of life, an inconvenience but something to be expected, like the way a traffic accident can slow down a highway. I have heard the occasional complaint ("I tried to pay my bills but I couldn't because the bank computers were down") -- but I hear more people complain (and more bitterly) about the busses being late or about traffic jams -- or about the power having failed and they have to go around and reset all their VCR clocks

And if people have become socialized to systems/network problems then that suggests that network/server problems are "normal" in many businesses -- as opposed to the mental model that networks/systems are rarely a problem most places and any operation which falls short of that has probably been designed or managed incorrectly.

- W
- Walter Roberson
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Apr 19, 2005 12:29 AM

:If you're running XP and your desktop machines in a place of business are :"out of action" "fairly frequently" you need to find out why and fix it.

I've been isolated for some years [this city is blooming nicely in biotechnology, but the nearest "high tech city" is ~900 miles away].

Perhaps I don't get around as much as I should... but as best I recall, I don't think I've ever met anyone who was actually skilled in configuring and debugging and repairing MS Windows. I've met a number of good unix/linux hackers, who could repair just about any software problem -- but with MS Windows, having a good clue about the Registry has been about the upper limit, after which the standard problem resolution stream seems to be "Reinstall the application. Reinstall Windows. Re-Ghost from a known-good system."

I'm certainly not trying to provoke a Unix vs Windows war here: I'm asking more: Has my sample been biased? Is there a good representation in IT of people who can -fix- MS Windows problems beyond "Search the Knowledgebase and check out the registry, and if you don't find the answer, then re-install?" And I certainly don't mean to cast stones at MS Windows specialists with this question: I'm asking seriously whether MS Windows gurus are uncommon or if I've just not noticed them.