At 6:32pm this evening, we were alerted to a full outage of our network in both Bolton and Manchester.
After remote diagnostics failed, an engineer attended one of the sites to perform detailed analysis.
Early analysis showed a large amount of traffic operating on our core ring network which looked like a Denial of Service attack. We attempted to locate the target to blackhole the destination but were unable to locate the target. (this is unusual)
After further diagnostics we discovered that we were seeing duplicate traffic on our core network – this should never happen as we have safeguards in place to prevent this.
We discovered that one of our Fibre interconnects was behaving abnormally. Our fibre connectivity from Virgin Media which connects one of our Manchester sites to one of our Bolton sites appears to have stopped passing some traffic. Unfortunately this traffic that was no longer passing over the link was a key part of our loop protection safeguards.
As a result, our core network began to loop which caused all traffic on the core network to duplicate exponentially until all core links were saturated.
We have isolated the Virgin Media circuit from our ring until cleared by VirginMedia. Our remaining interconnects are up and running as normal.
We will request an investigation from VirginMedia as our circuit is explicitly required to support the STP protocol (which is the protocol which has failed this evening)
Just to add a little further detail…
We use a Cisco layer2 switched ring which uses MST (a variant of the spanning-tree protocol).
MST/STP is a protocol which detects loops in a ring and severs part of the ring to prevent a loop. It does this is such a way that it can bring it online within a few ms in the event of a break elsewhere in the loop.
The virgin circuit this evening stopped passing these MST/STP packets through it so our switches (quite rightly) assumed there was a break in the ring and started sending traffic along the (normally) broken section of the ring.
However, the ring was actually complete.
This caused all traffic to go around and around the ring forever, constantly building in volume until it saturated the links and devices.
The MST/STP protocol is a safeguard and was effectively disabled/removed by virginmedia.
We will ensure that the protocol is operating before reconnecting the fibre circuit.
We have now implemented a failsafe for the failsafe function of our ring network which should prevent a 3rd party fibre provider from causing this issue in future. In future, our switches will detect the loss of functionality and disable the link as a precaution.
VirginMedia are still investigating the circuit, and it remains removed and sandboxed from our ring.
Only when we are satisfied that it is operating under the correct conditions will be reincorporate the circuit into our ring.
(and the additional failsafe will be active on the circuit once it is back in circuit)
VirginMedia have confirmed that the circuit is now operating normally.
The outage was caused by a firmware fault on the NTU at the Manchester end of the circuit which caused some protocols to fail to be transferred. It also caused their management of the device to fail.
The solution to this was to power cycle the NTU which took place earlier today.
We are now seeing STP on the sandbox at the Bolton end of the link and this has been stable for a few hours.
We will schedule reintegration of the circuit into our ring network shortly.
If this event happens again in future, the failsafe failsafe should automatically shutdown the circuit within a couple of seconds.
(it is triggered by not receiving STP packets so requires a timeout event to occur)