Possible Centurylink Outage Report [telecom]

This appears to be an outage report from Centurylink, but I can't veryify its authenticity. I had to substitute ASCII for some multi-byte characters.

Bill Horne

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  • Event Conclusion Summary * Outage Start: December 27, 2018 08:40 GMT Outage Stop: December 29, 2018 10:12 GMT Root Cause: A CenturyLink network management card in Denver, CO was propagating invalid frame packets across devices. Fix Action: To restore services the card in Denver was removed from the equipment, secondary communication channel tunnels between specific devices were removed across the network, and a polling filter was applied to adjust the way the packets were received in the equipment. As repair actions were underway, it became apparent that additional restoration steps were required for certain nodes, which included either line card resets or Field Operations dispatches for local equipment login. Once completed, all services restored. RFO Summary: On December 27, 2018 at 08:40 GMT, CenturyLink identified an initial service impact in New Orleans, LA. The NOC was engaged to investigate the cause, and Field Operations were dispatched for assistance onsite. Tier IV Equipment Vendor Support was engaged as it was determined that the issue was larger than a single site. During cooperative troubleshooting between the Equipment Vendor and CenturyLink, a decision was made to isolate a device in San Antonio, TX from the network as it seemed to be broadcasting traffic and consuming capacity. This action did alleviate impact; however, investigations remained ongoing. Focus shifted to additional sites where network teams were unable to remotely troubleshoot equipment. Field Operations were dispatched to sites in Kansas City, MO, Atlanta, GA, New Orleans, LA and Chicago, IL for onsite support. As visibility to equipment was regained, Tier IV Equipment Vendor Support evaluated the logs to further assist with isolation. Additionally, a polling filter was applied to the equipment in Kansas City, MO and New Orleans, LA to prevent any additional effects. All necessary troubleshooting teams, in cooperation with Tier IV Equipment Vendor Support, were working to restore remote visibility to the remaining sites. The issue had CenturyLink Executive level awareness for the duration. A plan was formed to remove secondary communication channels between select network devices until visibility could be restored, which was undertaken by the Tier IV Equipment Vendor Technical Support team in conjunction with CenturyLink Field Operations and NOC engineers. While that effort continued, investigations into the logs, including packet captures, was occurring in tandem, which ultimately identified a suspected card issue in Denver, CO. Field Operations were dispatched to remove the card. Once removed, it did not appear there had been significant improvement; however, the logs were further scrutinized by the Vendor's Advanced Support team and CenturyLink Network Operations to identify that the source packet did originate from this card. CenturyLink Tier III Technical Support shifted focus to the application of strategic polling filters along with the continued efforts to remove the secondary communication channels between select nodes. Services began incrementally restoring. An estimated restoral time of 09:00 GMT was provided; however, as repair efforts steadily progressed, additional steps were identified for certain nodes that impeded the restoration process. This included either line card resets or Field Operations dispatches for local equipment login. Various repair teams worked in tandem on these actions to ensure that services were restored in the most expeditious method available. By 2:30 GMT on December 29, it was confirmed that the impacted IP, Voice, and Ethernet Access services were once again operational. Point-to-point Transport Waves as well as Ethernet Private Lines were still experiencing issues as multiple Optical Carrier Groups (OCG) were still out of service. The Transport NOC continued to work with the Tier IV Equipment Vendor Support and CenturyLink Field Operations to replace additional line cards to resolve the OCG issues. Several cards had to be ordered from the nearest sparing depot. Once the remaining cards were replaced it was confirmed that all services except a very small set of circuits had restored, and the Transport NOC will continue to troubleshoot the remaining impacted services under a separate Network Event. Services were confirmed restored at 10:12 GMT. Please contact the Repair center to address any lingering service issues. Additional Information: Please note that as formal post incident investigations and analysis occur the details relayed here may evolve. Locating the management card in Denver, CO that was sending invalid frame packets across the network took significant analysis and packet captures to be identified as a source as it was not in an alarm status. The CenturyLink network continued to rebroadcast the invalid packets through the redundant (secondary) communication routes. CenturyLink will review troubleshooting steps to ensure that any areas of opportunity regarding potential for restoral acceleration are addressed. These invalid frame packets did not have a source, destination, or expiration and were cleared out of the network via the application of the polling filters and removal of the secondary communication paths between specific nodes. The management card has been sent to the equipment vendor where extensive forensic analysis will occur regarding the underlying cause, how the packets were introduced in this particular manner. The card has not been replaced and will not be until the vendor review is supplied. There is no increased network risk with leaving it unseated. At this time, there is no indication that there was maintenance work on the card, software, or adjacent equipment. The CenturyLink network is not at risk of reoccurrence due to the placement of the poling filters and the removal of the secondary communication routes between select nodes.

  • 2018-12-29 12:48:18 GMT - The Transport NOC continues to monitor the network to ensure impacted services have remained restored and stable. If additional issues are experienced, please contact the CenturyLink Repair Center. A final notification will be provided momentarily.

  • 2018-12-29 11:56:08 GMT - The Transport NOC advises Field Operations has replaced the impacted cards. The affected Optical Carrier G roups have stabilized, thus all service affecting alarms have cleared and impacted services have restored. The Transport NOC has identified and is aware of a smaller set of services that have not restored and will continue to investigate and resolve those services under an alternate Network Event.??The Transport NOC and equipment vendor are continuing to monitor for network stability;?if additional issues are experienced, please contact the CenturyLink Repair Center. A summary of the event will be provided momentarily.

  • 2018-12-29 10:48:39 GMT - The Transport NOC advises Field Operations has replaced the impacted cards and the replacement cards have booted up and are continuing to stabilize. The Transport NOC is monitoring to confirm impacted services have restored.

  • 2018-12-29 09:40:22 GMT - The Transport NOC advises Field Operations has received the line cards and, in cooperation with the equipment vendor, is commencing with replacements.

  • 2018-12-29 08:33:07 GMT - The Transport NOC has provided updated estimated time of arrivals for the replacement cards of 08:30 GMT a nd 09:00 GMT. Field Operations are on site and will replace the affected cards immediately upon receiving the replacement cards. The Transport NOC and Field Operations are continuing with troubleshooting efforts for the remaining impacted sites.

  • 2018-12-29 07:21:10 GMT - The Transport NOC reports continued repair progress as multiple Optical Channel Groups have restored. Repl acement line cards have been ordered for impacted sites with an ETA of 07:45 GMT and 08:30 GMT. Troubleshooting efforts remain ongoing at the remaining impacted sites by Field Operations and an equipment vendor.

  • 2018-12-29 05:40:47 GMT - The Transport NOC has advised that additional Optical Carrier Groups have restored; however, collaborative troubleshooting continues at the necessary locations, as multiple out service Optical Carrier Groups remain.

  • 2018-12-29 05:15:24 GMT - The Transport NOC has advised that additional Optical Carrier Groups have restored; however, collaborative troubleshooting continues at the necessary locations, as multiple out service Optical Carrier Groups remain.

  • 2018-12-29 03:52:30 GMT - The Transport NOC advises that Field Operations personnel are at the final two sites and are currently tro ubleshooting with the assistance from the equipment vendor.

  • 2018-12-29 02:34:38 GMT - The Transport NOC has advised that multiple Optical Carrier Groups have been cleared either remotely or wi th the assistance of Field Operations once they dispatched to impacted sites.??Additional Field Operations have been dispatched to clear the remaining Optical Carrier Groups that are still out of service and cannot be restored remotely.

  • 2018-12-29 01:25:17 GMT - The Transport NOC continues to work with the Equipment Vendor's Support Teams to investigate multiple Opti cal Carrier Groups that are still out of service impacting Point to Point Transport Waves as well as Ethernet Private Lines. Both CenturyLink and the Equipment Vendor?s Field Operations teams have dispatched to the necessary sites to assist with isolation. Additional cards have been ordered and shipped to sites across the United States in an effort to restore the Optical Carrier Groups to complete full network restoral.

  • 2018-12-29 00:31:23 GMT - Field Operations in cooperation with the Engineering teams have repaired the span traversing the western U nited States through loop testing. Once the equipment was restored, additional capacity was in turn available to the span on the CenturyLink Network. IP, Voice, and Ethernet Access services are expected to have restored with the now available capacity. Point-to-Point Transport Waves as well as Ethernet Private Lines may still experience issues while the remainder of the final card issues are resolved. Lingering latency may be present, which is anticipated to subside as routing continues to normalize. If issues are still being experienced with your IP, Voice, and Ethernet Access services please contact the CenturyLink Repair Center.

  • 2018-12-28 23:02:29 GMT - As the Equipment Vendor and CenturyLink Engineering teams continue to work to clear the lingering card iss ues it has been confirmed that alarms continue to clear, and network capacity is being restored. Efforts will remain ongoing to continue to resolve any further issues identified.

  • 2018-12-28 21:42:05 GMT - The Transport NOC has confirmed that visibility has been restored to all nodes, allowing triage of the add itional cards to be completed. Engineering continues to review the network to identify, review, and clear the remaining alarms and issues observed. Field Operations continue to remain on standby and dispatch to sites as necessary to assist with isolation and resolution.

  • 2018-12-28 20:31:40 GMT - Efforts to complete the line card resets remain ongoing, while additional support teams continue to triage chassis within a smaller set of nodes that did not have full visibility restored as well as additional line cards within the network. The highest level of Engineering support from both the Equipment Vendor as well as CenturyLink continue to diligently work to restore services.

  • 2018-12-28 19:27:05 GMT - CenturyLink Engineering in cooperation with the Equipment Vendor?s Tier IV Support continue to system atically review the network alarms and triage line cards within the network to ensure remote resets or physically reseats on site can be completed.

  • 2018-12-28 18:23:33 GMT - The Transport NOC has confirmed that visibility has been restored to the majority of the network outside o f a few remaining nodes that are in various states of recovery. Engineering has identified the line cards that will need to be reset and are working diligently to perform the necessary actions to bring all cards back online

  • 2018-12-28 17:15:20 GMT - It has been confirmed that visibility has been restored to the majority of the nodes across the network. F ield Operations have been dispatched to assist with recovering visibility to the few remaining nodes. Engineering is working to systematically review the network alarms on the other nodes and are then performing remote manual resets to individual cards that remain in alarm. Reinstate times for each card may vary significantly, as such an estimated completion time is not yet available. If cards do not automatically reinstate after remote resets complete, Field Operations are standing by to dispatch as needed. The Equipment Vendor's Tier IV team continues to assist with the resolution efforts

  • 2018-12-28 13:35:00 GMT - Efforts by the Equipment Vendor and CenturyLink engineers to apply the filters and remove the secondary co mmunication channels in the network continue. The previously provided ETR of 09:00 GMT remains.

  • 2018-12-28 13:27:30 GMT - The Equipment Vendor and CenturyLink engineers continue work to apply the filters and remove the secondary communication channels. Field Operations and Equipment Vendor dispatches to recover nodes locally remain underway. Services continue to restore in a steady manner as troubleshooting progresses following the recovery of nodes. CenturyLink NOC management remains in contact with the equipment vendor to obtain updates as restoration efforts continue.

  • 2018-12-28 11:04:24 GMT - CenturyLink continues to work with the Equipment Vendor to apply the filters and remove the secondary comm unication channels. Field Operations and Equipment Vendor dispatches to recover nodes locally remain underway. Client services continue to restore in a steady manner as troubleshooting progresses following the recovery of nodes.

  • 2018-12-28 10:05:18 GMT - CenturyLink NOC Management reports steady progression of node recovery and restoral of client services. In addition to the remote node recovery process, Field Operations continue to dispatch and assist the Equipment Vendor with local equipment login.

  • 2018-12-28 08:51:29 GMT - CenturyLink NOC Management has advised that repair efforts are steadily progressing, and services are incr ementally restoring. The Equipment Vendor and CenturyLink engineers continue work to apply the filters and remove the secondary communication channels at this time. There have been additional restoration steps identified for certain nodes, which includes either line card resets or Field Operations dispatches for local equipment login, that have impeded the restoration process. Various repair teams are working in tandem on these actions to ensure that services are restored in the most expeditious method available. Restoration efforts are ongoing.

  • 2018-12-28 07:12:32 GMT - Efforts by the Equipment Vendor and CenturyLink engineers to apply the filters and remove the secondary co mmunication channels in the network continue. Additional information on repair progress will be available from the Equipment Vendor by 07:30 GMT. Information will be relayed as soon as it is obtained.

  • 2018-12-28 06:00:01 GMT - Efforts by the Equipment Vendor and CenturyLink engineers to apply the filters and remove the secondary co mmunication channels in the network continue. The previously provided ETR of 09:00 GMT remains.

  • 2018-12-28 04:58:44 GMT - CenturyLink engineers in conjunction with the Equipment Vendor's Tier IV Technical Support team have identified the elements causing the impact to customer services. Through the filters being applied and the removal of the secondary communication channels, it is anticipated services will be fully restored within four hours. We apologize for any inconvenience this caused our customers. Additional details regarding details of the underlying cause will be relayed as available.

  • 2018-12-28 04:09:31 GMT - The Equipment Vendor's Tier IV Technical Support team in conjunction with CenturyLink Tier III Techn ical Support continues to remotely work to remove the secondary communication channel tunnels across the network until full visibility can be restored, as well as applying the necessary polling filter to each of the reachable nodes.

  • 2018-12-28 02:53:38 GMT - The Transport NOC has confirmed that cooperative efforts remain ongoing to remove the secondary communicat ion channel tunnel across the network until full visibility can be restored, as well as applying the necessary filter to each of the reachable nodes. It has been confirmed that both of these actions are being performed remotely, but an estimated time to complete the activities is not available at this time.

  • 2018-12-28 01:58:56 GMT - Once the card was removed in Denver, CO it was confirmed that there was no significant improvement. Additi onal packet captures, and logs will be pulled from the device with the card removed to further isolate the root cause. The Equipment vendor continues to work with CenturyLink Field Operations at multiple sites to remove the secondary communication channel tunnel across the network until full visibility can be restored. The equipment vendor has identified a number of additional nodes that visibility has been restored to, and their engineers are currently working to apply the necessary filter to each of the reachable nodes.

  • 2018-12-28 00:59:04 GMT - Following the review of the logs and packet captures, the Equipment Vendor's Tier IV Support team has iden tified a suspected card issue in Denver, CO. Field Operations has arrived on site and are working in cooperation with the Equipment Vendor to remove the card.

  • 2018-12-27 23:57:16 GMT - The Equipment Vendor is currently reviewing the logs and packet captures from devices that have been compl eted, while logs and packet captures continue to be pulled from additional devices. The necessary teams continue to remove a secondary communication channel tunnel across the network until visibility can be restored. All technical teams continue to diligently work to review the information obtained in an effort to isolate the root cause.

  • 2018-12-27 22:52:43 GMT - Multiple teams continue work to pull additional logs and packet captures on devices that have had visibili ty restored, which will be scrutinized during root cause analysis. The Tier IV Equipment Vendor Technical Support team in conjunction with Field Operations are working to remove a secondary communication channel tunnel across the network until visibility can be restored. The Equipment Vendor Support team has dispatched their Field Operations team to the site in Chicago, IL and has been obtaining data directly from the equipment.

  • 2018-12-27 21:35:55 GMT - It has been advised that visibility has been restored to both the Chicago, IL and Atlanta, GA sites. Engin eering and Tier IV Equipment Vendor Technical Support are currently working to obtain additional logs from devices across multiple sites including Chicago and Atlanta to further isolate the root cause.

  • 2018-12-27 21:01:26 GMT - On December 27, 2018 at 02:40 GMT, CenturyLink identified a service impact in New Orleans, LA. The NOC was engaged and investigating in order to isolate the cause. Field Operations were engaged and dispatched for additional investigations. Tier IV Equipment Vendor Support was later engaged. During cooperative troubleshooting a device in San Antonio, TX was isolated from the network as it was seeming to broadcast traffic consuming capacity, which seemed to alleviate some impact. Investigations remained ongoing. Following the isolation of the San Antonio, TX device troubleshooting efforts focused on additional sites that teams were remotely unable to troubleshoot. Field Operations were dispatched to sites in Kansas City, MO, Atlanta, GA, New Orleans, LA and Chicago, IL. Tier IV Equipment Vendor Support continued to investigate the equipment logs to further assist with isolation. Once visibility was restored to the site in Kansas City, MO and New Orleans, LA a filter was applied to the equipment to further alleviate the impact observed. All of the necessary troubleshooting teams in cooperation with Tier IV Equipment Vendor Support are working to restore remote visibility to the remaining sites at this time. Tier IV Equipment Vendor Technical Support continues to review equipment logs from the sites where visibility was previously restored. We understand how important these services are to our clients and the issue has been escalated to the highest levels within CenturyLink Service Assurance Leadership.

formatting link

***** Moderator's Note *****

This notice doesn't mention 911. That's puzzling: there were outages of 911 service in many areas, although they are reported as being limited to cellular users.

The report inplies that a fault occured in several high-capacity MUXes, which IIRC wouldn't ususally be used to carry 911 traffic. My experience was all in wireline, so I'll ask those of you who work in the mobile world if Centurylink is allowed to have mobile switches carry traffic across LATA boundaries.

Bill Horne Moderator

Reply to
Bill Horne
Loading thread data ...

Cabling-Design.com Forums website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.