At 3:52pm today, one of our Cisco ASR routers experienced a crash within its routing engine.
This caused the router to instantly stop routing and any destinations via the router experienced an outage.
Unfortunately, this did not just sever connectivity cleanly… it started causing “flapping” (where routes are introduced and removed over and over again causing instability). Once this flapping was identified, we severed all network connectivity to the affected router.
After a few minutes, BGP failover took over and traffic re-routed via alternative paths as it is designed to do. This is how a normal crash would be handled.
The router crashed in such a way that it had to be physically power cycled to regain control afterwards. We then brought its routing online in a slow and controlled fashion to prevent any further disruption to the network.
After some research, it appears that we hit CSCus82903 which is a known Cisco Bug in our edition of routing software.
This was triggered when attempting to bring online our new IP connectivity provider, GTT, this afternoon – a normally routine procedure with no impact to customer traffic.
Our routers are currently stable and operating normally, however we need to perform some emergency maintenance to upgrade the software of the routers to a patched version provided by Cisco.
This should be able to occur without causing any additional outages, although the network routing should be considered “at risk” during the actual software upgrade.
In the meantime, our GTT connection has been kept offline to prevent the issue reappearing. We will re-establish the connection once the software upgrades are complete.