On 1/7/2019 4:23 PM, HAncock4 wrote: ...
Every piece of major network equipment has redundancy built in, and networks themselves have redundancy in their backbone routes. Even a company as hapless as CTL knows that. Singing praises of Old Ma Bell and her primitive 1ESS does nothing to advance the art.
Any system with redundancy has to have some mechanism that invokes it, a switchover mechanism of some sort. Even the 1ESS had to know when to switch. In a network with redundant routes, there needs to be some mechanism to determine what route to use, based on knowledge of which links are working and which aren't.
What failed at CTL was the mechanism for implementing that redundancy. A control card in an optical multiplexor in Denver seems to have sent out "packets of death", malformed packets on a management channel. Apparently a bug in the system's code did not discard these upon receipt but propagated them, causing them to spread across the network. And they went out on the "secondary" (redundant) paths too.
So we have a hardware failure (bad card) and a software failure (not discarding bad packets), and together they caused the mechanism for implementing redundancy (the control plane) to malfunction. I've seen similar things happen elsewhere. (I investigate E911 failures for state regulators.)