At approx 2:56pm UK local time today, we saw packet loss across our entire metro network for around 4 minutes.
Initial diagnostics showed that one of our core network ring Cisco 3850-series switches had failed (it was stuck in a boot loop with failing self-tests).
Ordinarily, this should not have caused 4 minutes of outage.
Our network operates using multiple layer-2 rings around our various datacentres and also a metro ring between Bolton and Manchester. The failed switch today was a member of multiple rings and a core link in the metro ring. Further, this switch was also a designated root switch which caused a full network reconvergence and extended the outage time.
We should note that this is the only Cisco 3850-series switch that we have had fail, and they are generally very reliable switch models.
Further analysis this evening also highlighted some additional issues. Some switches in our rings were operating at 100% CPU usage which caused reconvergence to be slower than usual. This was due to a Cisco feature for ARP inspection which has now been disabled (CPU usage is now 6-7% which is normal)
To compound issues further, this also caused the breakdown of a 20Gbps LACP link to one of our core routers (due to CPU contention) which had a knock-on effect of further routing instability.
By correcting this CPU usage, we should also have mitigated the core router issue in future. The LACP port group should be stable even under reconvergence.
Rather than replacing the failed Cisco 3850-series switch, we migrated all ports from the switch over to our new ring network ahead of schedule. This is planned for all switches in our existing core switch network over the coming months.
As a side effect of the switch failure, we also lost the primary links from our BOL8 hypervisor platform (but secondary links to alternate switches remained online) – this is a scenario that is planned for and should not cause any issues in operation. The hypervisors acted as expected, failing over to their secondary links on another switch, However…
We use Equallogic SANs for our hypervisor storage, and had followed the vendor’s recommended network configuration by connecting ports eth0+eth1 to one switch, and eth2+mgmt to a second switch. Unfortunately, the primary switch (housing eth0+eth1) failed today and some hypervisors had primary and backup paths to ports eth0+eth1 (not eth2) which caused them to lose sight of some SAN storage exports. This only affected 2 exports, but this is still unacceptable. A primary path failure should not affect storage availabilty.
Path selection on Equallogic SANs is decided by the SAN itself. Hypervisors talk to a group IP on the SAN, and it diverts each connection to a different port for resilience. Unfortunately, it did cause a couple of links to use two ports connected to the same switch (which then failed taking both primary and failover paths offline).
We have now bypassed the vendor’s recommendations and connected ports eth0, eth1, and eth2 of each Equallogic SAN in BOL8 to different switches (that’s 3 switches, 3 ports) – the management port is unimportant as it is reserved purely for management purposes and is not used for any storage traffic – as such, this will share one of the three switches.
We will migrate the storage paths of our Equallogic SANs in BOL23 to different switches also to prevent a future similar issue in that datacentre.
This will mitigate this particular issue in future as a switch failure will only ever cause a single port to fail – all our hypervisors have dual paths to each SAN export and therefore would always have at least one remaining path available.
Although our hypervisors did handle the primary port failures as expected, we have also migrated their primary ports across several switches.