At 6:55am this morning, we saw Hypervisor DT-324 crash with a pink-screen-of-death.
This has been rebooted after a hardware check and any virtual machines located on this HV are now being booted.
As this is the 2nd PSOD on this hypervisor in the last year, we will now take steps to replace it.
At 3:52pm today, one of our Cisco ASR routers experienced a crash within its routing engine.
This caused the router to instantly stop routing and any destinations via the router experienced an outage.
Unfortunately, this did not just sever connectivity cleanly… it started causing “flapping” (where routes are introduced and removed over and over again causing instability). Once this flapping was identified, we severed all network connectivity to the affected router.
After a few minutes, BGP failover took over and traffic re-routed via alternative paths as it is designed to do. This is how a normal crash would be handled.
The router crashed in such a way that it had to be physically power cycled to regain control afterwards. We then brought its routing online in a slow and controlled fashion to prevent any further disruption to the network.
After some research, it appears that we hit CSCus82903 which is a known Cisco Bug in our edition of routing software.
This was triggered when attempting to bring online our new IP connectivity provider, GTT, this afternoon – a normally routine procedure with no impact to customer traffic.
Our routers are currently stable and operating normally, however we need to perform some emergency maintenance to upgrade the software of the routers to a patched version provided by Cisco.
This should be able to occur without causing any additional outages, although the network routing should be considered “at risk” during the actual software upgrade.
In the meantime, our GTT connection has been kept offline to prevent the issue reappearing. We will re-establish the connection once the software upgrades are complete.
At 11:55pm tonight, our connection to Level3 appeared to disconnect. The symptoms are identical to the previous occasions where the Level3 router rebooted so we assume this to be the case again tonight.
The connection is slowly coming back online, which further supports this suspicion.
As Level3 appear to be unconcerned by a router that reboots itself, we have already made plans to migrate to an alternative transit supplier. This is due to take place early February 2017.
Once this replacement is online, we will likely retire our Level3 connectivity entirely.
Some routes may have experienced a brief period of packet loss or increased latency during this transit fault whilst routers around the internet changed paths to our alternate transit links. The majority of destinations would have been entirely unaffected however.
As this is caused by routers outside of our control, we have no way to prevent this brief packet loss to some destinations.
Any paths via our peering links or other transit connectivity are unaffected.
Yesterday, at around 11:50am, we had some reports of unusual activity on our ring network.
We traced this down to elevated levels of CPU usage on our switches. Further investigation pointed to a high level of IGMP traffic being received.
We further traced this to our LINX Extreme peering port in London and shut the port down. This resolved the issue instantly.
We contacted our Connexions provider, who provide our LINX ports. They did investigation to ensure the traffic was not originating internally and then passed the query onto LINX.
LINX confirmed this morning that they had a member port injecting IGMP traffic to the peering LAN yesterday and that this has now been resolved.
We have re-enabled our LINX Extreme port and confirmed the issue no longer exists.
We will re-enable our peering connections via this LAN shortly.
Earlier this evening we identified one of our Cisco switch stacks misbehaving causing a constant stream of stack reconverges – this constant reconverge event has been causing layer 2 network instability for traffic flowing via the stack of nine switches.
We have just completed a physical inspection of all stack cables, including full reseating of cables as per Cisco’s guidelines – however the problem still persists.
It is possible that the stack issues are caused via a software fault within the Cisco IOS software.
We are currently applying an upgrade to the switch stack and will reboot the full stack afterwards to activate the changes.
Cisco advise doing this as a full cold reboot by removing the power from the stack members so this reload will take longer than usual.
Any customers connected to a different switch stack will only see a momentary outage during a layer 2 reconvergence.
Customers directly connected to switch stack DS-101 will see a total outage for up to 15 minutes.
At 4:35pm we received alerts that one of our hypervisors in Telecity Williams House, Manchester had ceased to process disk activity.
No hardware alerts were present, so a forced reboot of the hypervisor was required.
The server is now responding normally, and all virtual machines powered back up.
From just before 8am UK local time this morning, we have seen reports of intermittent connectivity issues with both BT and Plusnet.
We believe this is due to a power outage in one of the Telehouse datacentres in London.
(This is in addition to a reported power outage in Telecity Harbour Exchange Square in London yesterday which also affected BT)
This outage does not directly affect Netnorth, however it seems to be causing congestion for BT and Plusnet which means you may have trouble reaching some destinations if you use one of these providers for your internet connection.
This may also affect VirginMedia, however we have a direct link to VM in Manchester which bypasses most congestion on their network.
As the fault lies external to our network, we are unable to take any remedial action from our side. The BT issue lies inside BT’s network at this time.
We have currently lost a resilient path between two of our Manchester datacentre locations. Our building interconnection provider is currently performing maintenance on their equipment, however this outage is unexpected.
We have reported it to the vendor for investigation.
In the meantime, our TCW site is running at reduced resilience.
Our other sites are still operating fully resilient.
At 17:28 (UK Local Time), we noticed a drop in our Level3 connectivity.
Level3 remains unavailable to us at this time. We have opened a ticket with them to investigate.
Our network has reconverged to use our alternate providers. This would have caused some instability for any routes via Level3 (but not via our other paths) during the convergence.
Since 11:55pm UK local time, we’ve had unstable IPv6 connectivity via Level3 in Manchester (approx. 60% packet loss).
This has been confirmed from several Level3 router locations (Manchester, London, Amsterdam) and appears to be quite widespread.
As a result, we have temporarily removed our IPv6 connectivity via Level3 pending an update from them regarding this issue.
Note: this does not affect IPv4 traffic, nor IPv6 destinations reachable via our alternate connectivity providers.