Possible Centurylink Outage Report [telecom]

Have a question or want to start a discussion? Post it! No Registration Necessary.  Now with pictures!

This appears to be an outage report from Centurylink, but I can't
veryify its authenticity. I had to substitute ASCII for some  
multi-byte characters.

Bill Horne

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


* Event Conclusion Summary * Outage Start: December 27, 2018 08:40 GMT
  Outage Stop: December 29, 2018 10:12 GMT Root Cause: A CenturyLink
  network management card in Denver, CO was propagating invalid frame
  packets across devices. Fix Action: To restore services the card in
  Denver was removed from the equipment, secondary communication
  channel tunnels between specific devices were removed across the
  network, and a polling filter was applied to adjust the way the
  packets were received in the equipment. As repair actions were
  underway, it became apparent that additional restoration steps were
  required for certain nodes, which included either line card resets
  or Field Operations dispatches for local equipment login. Once
  completed, all services restored. RFO Summary: On December 27, 2018
  at 08:40 GMT, CenturyLink identified an initial service impact in
  New Orleans, LA. The NOC was engaged to investigate the cause, and
  Field Operations were dispatched for assistance onsite. Tier IV
  Equipment Vendor Support was engaged as it was determined that the
  issue was larger than a single site. During cooperative
  troubleshooting between the Equipment Vendor and CenturyLink, a
  decision was made to isolate a device in San Antonio, TX from the
  network as it seemed to be broadcasting traffic and consuming
  capacity. This action did alleviate impact; however, investigations
  remained ongoing. Focus shifted to additional sites where network
  teams were unable to remotely troubleshoot equipment. Field
  Operations were dispatched to sites in Kansas City, MO, Atlanta, GA,
  New Orleans, LA and Chicago, IL for onsite support. As visibility to
  equipment was regained, Tier IV Equipment Vendor Support evaluated
  the logs to further assist with isolation. Additionally, a polling
  filter was applied to the equipment in Kansas City, MO and New
  Orleans, LA to prevent any additional effects. All necessary
  troubleshooting teams, in cooperation with Tier IV Equipment Vendor
  Support, were working to restore remote visibility to the remaining
  sites. The issue had CenturyLink Executive level awareness for the
  duration. A plan was formed to remove secondary communication
  channels between select network devices until visibility could be
  restored, which was undertaken by the Tier IV Equipment Vendor
  Technical Support team in conjunction with CenturyLink Field
  Operations and NOC engineers. While that effort continued,
  investigations into the logs, including packet captures, was
  occurring in tandem, which ultimately identified a suspected card
  issue in Denver, CO. Field Operations were dispatched to remove the
  card. Once removed, it did not appear there had been significant
  improvement; however, the logs were further scrutinized by the
  Vendor's Advanced Support team and CenturyLink Network Operations to
  identify that the source packet did originate from this
  card. CenturyLink Tier III Technical Support shifted focus to the
  application of strategic polling filters along with the continued
  efforts to remove the secondary communication channels between
  select nodes. Services began incrementally restoring. An estimated
  restoral time of 09:00 GMT was provided; however, as repair efforts
  steadily progressed, additional steps were identified for certain
  nodes that impeded the restoration process. This included either
  line card resets or Field Operations dispatches for local equipment
  login. Various repair teams worked in tandem on these actions to
  ensure that services were restored in the most expeditious method
  available. By 2:30 GMT on December 29, it was confirmed that the
  impacted IP, Voice, and Ethernet Access services were once again
  operational. Point-to-point Transport Waves as well as Ethernet
  Private Lines were still experiencing issues as multiple Optical
  Carrier Groups (OCG) were still out of service. The Transport NOC
  continued to work with the Tier IV Equipment Vendor Support and
  CenturyLink Field Operations to replace additional line cards to
  resolve the OCG issues. Several cards had to be ordered from the
  nearest sparing depot. Once the remaining cards were replaced it was
  confirmed that all services except a very small set of circuits had
  restored, and the Transport NOC will continue to troubleshoot the
  remaining impacted services under a separate Network Event. Services
  were confirmed restored at 10:12 GMT. Please contact the Repair
  center to address any lingering service issues. Additional
  Information: Please note that as formal post incident investigations
  and analysis occur the details relayed here may evolve. Locating the
  management card in Denver, CO that was sending invalid frame packets
  across the network took significant analysis and packet captures to
  be identified as a source as it was not in an alarm status. The
  CenturyLink network continued to rebroadcast the invalid packets
  through the redundant (secondary) communication routes. CenturyLink
  will review troubleshooting steps to ensure that any areas of
  opportunity regarding potential for restoral acceleration are
  addressed. These invalid frame packets did not have a source,
  destination, or expiration and were cleared out of the network via
  the application of the polling filters and removal of the secondary
  communication paths between specific nodes. The management card has
  been sent to the equipment vendor where extensive forensic analysis
  will occur regarding the underlying cause, how the packets were
  introduced in this particular manner. The card has not been replaced
  and will not be until the vendor review is supplied. There is no
  increased network risk with leaving it unseated. At this time, there
  is no indication that there was maintenance work on the card,
  software, or adjacent equipment. The CenturyLink network is not at
  risk of reoccurrence due to the placement of the poling filters and
  the removal of the secondary communication routes between select
  nodes.


* 2018-12-29 12:48:18 GMT - The Transport NOC continues to monitor the
  network to ensure impacted services have remained restored and
  stable. If additional issues are experienced, please contact the
  CenturyLink Repair Center. A final notification will be provided
  momentarily.


* 2018-12-29 11:56:08 GMT - The Transport NOC advises Field Operations
  has replaced the impacted cards. The affected Optical Carrier G
  roups have stabilized, thus all service affecting alarms have
  cleared and impacted services have restored. The Transport NOC has
  identified and is aware of a smaller set of services that have not
  restored and will continue to investigate and resolve those services
  under an alternate Network Event.é©´é©´The Transport NOC and
  equipment vendor are continuing to monitor for network
  stability;é©´if additional issues are experienced, please contact
  the CenturyLink Repair Center. A summary of the event will be
  provided momentarily.


* 2018-12-29 10:48:39 GMT - The Transport NOC advises Field Operations
  has replaced the impacted cards and the replacement cards have
  booted up and are continuing to stabilize. The Transport NOC is
  monitoring to confirm impacted services have restored.


* 2018-12-29 09:40:22 GMT - The Transport NOC advises Field Operations
  has received the line cards and, in cooperation with the equipment
  vendor, is commencing with replacements.


* 2018-12-29 08:33:07 GMT - The Transport NOC has provided updated
  estimated time of arrivals for the replacement cards of 08:30 GMT a
  nd 09:00 GMT. Field Operations are on site and will replace the
  affected cards immediately upon receiving the replacement cards. The
  Transport NOC and Field Operations are continuing with
  troubleshooting efforts for the remaining impacted sites.


* 2018-12-29 07:21:10 GMT - The Transport NOC reports continued repair
  progress as multiple Optical Channel Groups have restored. Repl
  acement line cards have been ordered for impacted sites with an ETA
  of 07:45 GMT and 08:30 GMT. Troubleshooting efforts remain ongoing
  at the remaining impacted sites by Field Operations and an equipment
  vendor.


* 2018-12-29 05:40:47 GMT - The Transport NOC has advised that
  additional Optical Carrier Groups have restored; however,
  collaborative troubleshooting continues at the necessary locations,
  as multiple out service Optical Carrier Groups remain.


* 2018-12-29 05:15:24 GMT - The Transport NOC has advised that
  additional Optical Carrier Groups have restored; however,
  collaborative troubleshooting continues at the necessary locations,
  as multiple out service Optical Carrier Groups remain.


* 2018-12-29 03:52:30 GMT - The Transport NOC advises that Field
  Operations personnel are at the final two sites and are currently
  tro ubleshooting with the assistance from the equipment vendor.


* 2018-12-29 02:34:38 GMT - The Transport NOC has advised that
  multiple Optical Carrier Groups have been cleared either remotely or
  wi th the assistance of Field Operations once they dispatched to
  impacted sites.é©´é©´Additional Field Operations have been
  dispatched to clear the remaining Optical Carrier Groups that are
  still out of service and cannot be restored remotely.


* 2018-12-29 01:25:17 GMT - The Transport NOC continues to work with
  the Equipment Vendor's Support Teams to investigate multiple Opti
  cal Carrier Groups that are still out of service impacting Point to
  Point Transport Waves as well as Ethernet Private Lines. Both
  CenturyLink and the Equipment Vendoré©´s Field Operations teams
  have dispatched to the necessary sites to assist with
  isolation. Additional cards have been ordered and shipped to sites
  across the United States in an effort to restore the Optical Carrier
  Groups to complete full network restoral.


* 2018-12-29 00:31:23 GMT - Field Operations in cooperation with the
  Engineering teams have repaired the span traversing the western U
  nited States through loop testing. Once the equipment was restored,
  additional capacity was in turn available to the span on the
  CenturyLink Network. IP, Voice, and Ethernet Access services are
  expected to have restored with the now available
  capacity. Point-to-Point Transport Waves as well as Ethernet Private
  Lines may still experience issues while the remainder of the final
  card issues are resolved. Lingering latency may be present, which is
  anticipated to subside as routing continues to normalize. If issues
  are still being experienced with your IP, Voice, and Ethernet Access
  services please contact the CenturyLink Repair Center.


* 2018-12-28 23:02:29 GMT - As the Equipment Vendor and CenturyLink
  Engineering teams continue to work to clear the lingering card iss
  ues it has been confirmed that alarms continue to clear, and network
  capacity is being restored. Efforts will remain ongoing to continue
  to resolve any further issues identified.


* 2018-12-28 21:42:05 GMT - The Transport NOC has confirmed that
  visibility has been restored to all nodes, allowing triage of the
  add itional cards to be completed. Engineering continues to review
  the network to identify, review, and clear the remaining alarms and
  issues observed. Field Operations continue to remain on standby and
  dispatch to sites as necessary to assist with isolation and
  resolution.


* 2018-12-28 20:31:40 GMT - Efforts to complete the line card resets
  remain ongoing, while additional support teams continue to triage
  chassis within a smaller set of nodes that did not have full
  visibility restored as well as additional line cards within the
  network. The highest level of Engineering support from both the
  Equipment Vendor as well as CenturyLink continue to diligently work
  to restore services.


* 2018-12-28 19:27:05 GMT - CenturyLink Engineering in cooperation
  with the Equipment Vendoré©´s Tier IV Support continue to system
  atically review the network alarms and triage line cards within the
  network to ensure remote resets or physically reseats on site can be
  completed.


* 2018-12-28 18:23:33 GMT - The Transport NOC has confirmed that
  visibility has been restored to the majority of the network outside
  o f a few remaining nodes that are in various states of
  recovery. Engineering has identified the line cards that will need
  to be reset and are working diligently to perform the necessary
  actions to bring all cards back online


* 2018-12-28 17:15:20 GMT - It has been confirmed that visibility has
  been restored to the majority of the nodes across the network. F
  ield Operations have been dispatched to assist with recovering
  visibility to the few remaining nodes. Engineering is working to
  systematically review the network alarms on the other nodes and are
  then performing remote manual resets to individual cards that remain
  in alarm. Reinstate times for each card may vary significantly, as
  such an estimated completion time is not yet available. If cards do
  not automatically reinstate after remote resets complete, Field
  Operations are standing by to dispatch as needed. The Equipment
  Vendor's Tier IV team continues to assist with the resolution
  efforts


* 2018-12-28 13:35:00 GMT - Efforts by the Equipment Vendor and
  CenturyLink engineers to apply the filters and remove the secondary
  co mmunication channels in the network continue. The previously
  provided ETR of 09:00 GMT remains.


* 2018-12-28 13:27:30 GMT - The Equipment Vendor and CenturyLink
  engineers continue work to apply the filters and remove the
  secondary communication channels. Field Operations and Equipment
  Vendor dispatches to recover nodes locally remain underway. Services
  continue to restore in a steady manner as troubleshooting progresses
  following the recovery of nodes. CenturyLink NOC management remains
  in contact with the equipment vendor to obtain updates as
  restoration efforts continue.


* 2018-12-28 11:04:24 GMT - CenturyLink continues to work with the
  Equipment Vendor to apply the filters and remove the secondary comm
  unication channels. Field Operations and Equipment Vendor dispatches
  to recover nodes locally remain underway. Client services continue
  to restore in a steady manner as troubleshooting progresses
  following the recovery of nodes.


* 2018-12-28 10:05:18 GMT - CenturyLink NOC Management reports steady
  progression of node recovery and restoral of client services. In
  addition to the remote node recovery process, Field Operations
  continue to dispatch and assist the Equipment Vendor with local
  equipment login.


* 2018-12-28 08:51:29 GMT - CenturyLink NOC Management has advised
  that repair efforts are steadily progressing, and services are incr
  ementally restoring. The Equipment Vendor and CenturyLink engineers
  continue work to apply the filters and remove the secondary
  communication channels at this time. There have been additional
  restoration steps identified for certain nodes, which includes
  either line card resets or Field Operations dispatches for local
  equipment login, that have impeded the restoration process. Various
  repair teams are working in tandem on these actions to ensure that
  services are restored in the most expeditious method
  available. Restoration efforts are ongoing.


* 2018-12-28 07:12:32 GMT - Efforts by the Equipment Vendor and
  CenturyLink engineers to apply the filters and remove the secondary
  co mmunication channels in the network continue. Additional
  information on repair progress will be available from the Equipment
  Vendor by 07:30 GMT. Information will be relayed as soon as it is
  obtained.


* 2018-12-28 06:00:01 GMT - Efforts by the Equipment Vendor and
  CenturyLink engineers to apply the filters and remove the secondary
  co mmunication channels in the network continue. The previously
  provided ETR of 09:00 GMT remains.


* 2018-12-28 04:58:44 GMT - CenturyLink engineers in conjunction with
  the Equipment Vendor's Tier IV Technical Support team have
  identified the elements causing the impact to customer
  services. Through the filters being applied and the removal of the
  secondary communication channels, it is anticipated services will be
  fully restored within four hours. We apologize for any
  inconvenience this caused our customers. Additional details
  regarding details of the underlying cause will be relayed as
  available.


* 2018-12-28 04:09:31 GMT - The Equipment Vendor's Tier IV
  Technical Support team in conjunction with CenturyLink Tier III
  Techn ical Support continues to remotely work to remove the
  secondary communication channel tunnels across the network until
  full visibility can be restored, as well as applying the necessary
  polling filter to each of the reachable nodes.


* 2018-12-28 02:53:38 GMT - The Transport NOC has confirmed that
  cooperative efforts remain ongoing to remove the secondary
  communicat ion channel tunnel across the network until full
  visibility can be restored, as well as applying the necessary filter
  to each of the reachable nodes. It has been confirmed that both of
  these actions are being performed remotely, but an estimated time to
  complete the activities is not available at this time.


* 2018-12-28 01:58:56 GMT - Once the card was removed in Denver, CO it
  was confirmed that there was no significant improvement. Additi onal
  packet captures, and logs will be pulled from the device with the
  card removed to further isolate the root cause. The Equipment vendor
  continues to work with CenturyLink Field Operations at multiple
  sites to remove the secondary communication channel tunnel across
  the network until full visibility can be restored. The equipment
  vendor has identified a number of additional nodes that visibility
  has been restored to, and their engineers are currently working to
  apply the necessary filter to each of the reachable nodes.


* 2018-12-28 00:59:04 GMT - Following the review of the logs and
  packet captures, the Equipment Vendor's Tier IV Support team has
  iden tified a suspected card issue in Denver, CO. Field Operations
  has arrived on site and are working in cooperation with the
  Equipment Vendor to remove the card.


* 2018-12-27 23:57:16 GMT - The Equipment Vendor is currently
  reviewing the logs and packet captures from devices that have been
  compl eted, while logs and packet captures continue to be pulled
  from additional devices. The necessary teams continue to remove a
  secondary communication channel tunnel across the network until
  visibility can be restored. All technical teams continue to
  diligently work to review the information obtained in an effort to
  isolate the root cause.


* 2018-12-27 22:52:43 GMT - Multiple teams continue work to pull
  additional logs and packet captures on devices that have had
  visibili ty restored, which will be scrutinized during root cause
  analysis. The Tier IV Equipment Vendor Technical Support team in
  conjunction with Field Operations are working to remove a secondary
  communication channel tunnel across the network until visibility can
  be restored. The Equipment Vendor Support team has dispatched their
  Field Operations team to the site in Chicago, IL and has been
  obtaining data directly from the equipment.


* 2018-12-27 21:35:55 GMT - It has been advised that visibility has
  been restored to both the Chicago, IL and Atlanta, GA sites. Engin
  eering and Tier IV Equipment Vendor Technical Support are currently
  working to obtain additional logs from devices across multiple sites
  including Chicago and Atlanta to further isolate the root cause.


* 2018-12-27 21:01:26 GMT - On December 27, 2018 at 02:40 GMT,
  CenturyLink identified a service impact in New Orleans, LA. The NOC
  was engaged and investigating in order to isolate the cause. Field
  Operations were engaged and dispatched for additional
  investigations.  Tier IV Equipment Vendor Support was later
  engaged. During cooperative troubleshooting a device in San Antonio,
  TX was isolated from the network as it was seeming to broadcast
  traffic consuming capacity, which seemed to alleviate some
  impact. Investigations remained ongoing. Following the isolation of
  the San Antonio, TX device troubleshooting efforts focused on
  additional sites that teams were remotely unable to
  troubleshoot. Field Operations were dispatched to sites in Kansas
  City, MO, Atlanta, GA, New Orleans, LA and Chicago, IL. Tier IV
  Equipment Vendor Support continued to investigate the equipment logs
  to further assist with isolation. Once visibility was restored to
  the site in Kansas City, MO and New Orleans, LA a filter was applied
  to the equipment to further alleviate the impact observed. All of
  the necessary troubleshooting teams in cooperation with Tier IV
  Equipment Vendor Support are working to restore remote visibility to
  the remaining sites at this time. Tier IV Equipment Vendor Technical
  Support continues to review equipment logs from the sites where
  visibility was previously restored. We understand how important
  these services are to our clients and the issue has been escalated
  to the highest levels within CenturyLink Service Assurance
  Leadership.

https://fuckingcenturylink.com/

***** Moderator's Note *****

This notice doesn't mention 911. That's puzzling: there were outages
of 911 service in many areas, although they are reported as being
limited to cellular users.

The report inplies that a fault occured in several high-capacity
MUXes, which IIRC wouldn't ususally be used to carry 911 traffic. My
experience was all in wireline, so I'll ask those of you who work in
the mobile world if Centurylink is allowed to have mobile switches
carry traffic across LATA boundaries.

Bill Horne
Moderator


--  
Bill Horne
(Remove QRM from my email address to write to me directly)

Site Timeline