Re: CenturyLink's 911 Outage + One Bad Network Card? [telecom]

On 1/7/2019 4:23 PM, HAncock4 wrote: ...

> >>
formatting link
> > When the Bell System introduced automation 100 years ago, > reliability was a key issue. Dial equipment included a full > set of testing facilities and alarms. > > When ESS was introduced 50 years ago, again reliability was > an issue. The CPU was duplicated and the backup CPU was always > ready in case the first failed. Further, a good deal of the > software that controlled the ESS contained testing and diagnostic > instructions so that circuit failures could be quickly identified, > isolated, and repaired. >

Every piece of major network equipment has redundancy built in, and networks themselves have redundancy in their backbone routes. Even a company as hapless as CTL knows that. Singing praises of Old Ma Bell and her primitive 1ESS does nothing to advance the art.

Failures happen. They always will. The question for CenturyLink > isn't that something failed, but rather why did a failure propagate > through its network and why did it take so long to be identified > and resolved. >

Any system with redundancy has to have some mechanism that invokes it, a switchover mechanism of some sort. Even the 1ESS had to know when to switch. In a network with redundant routes, there needs to be some mechanism to determine what route to use, based on knowledge of which links are working and which aren't.

What failed at CTL was the mechanism for implementing that redundancy. A control card in an optical multiplexor in Denver seems to have sent out "packets of death", malformed packets on a management channel. Apparently a bug in the system's code did not discard these upon receipt but propagated them, causing them to spread across the network. And they went out on the "secondary" (redundant) paths too.

So we have a hardware failure (bad card) and a software failure (not discarding bad packets), and together they caused the mechanism for implementing redundancy (the control plane) to malfunction. I've seen similar things happen elsewhere. (I investigate E911 failures for state regulators.)

Reply to
Fred Goldstein
Loading thread data ...

Cabling-Design.com Forums website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.