BOL23 – SAN controller failure – 20/Jul/2022

At approx 9:37pm tonight one of our SAN devices in BOL23 (EQL-SAS-4) reported that its primary controller was not responding.

The controller triggered a kernel panic which caused the controller to reboot. The failover controller took over once it had confirmed the primary controller was not responding.

This may have caused some virtual machines to receive disk timeouts. These should be gracefully handled (depending on the drivers within the OS, and whether VMtools is installed) but may generate errors to event logs.

The timeouts are due to equallogic SANs working in active/passive failover mode which requires a few seconds to fully failover.

After the original primary controller recovered, it is now reporting a nvram cache battery failure so will need to have this replaced.

We will schedule this replacement. As it is now a backup module due to the failover event, this should not be disruptive – however it will remove the failover capability during the replacement period.

Bookmark the permalink.

One Response to BOL23 – SAN controller failure – 20/Jul/2022

  1. Dan A says:

    Around 30 minutes after the failover occurred, the failover controller also reported a capacitor failure and pushed the SAN into write-through mode (which is slow).

    We have replaced the active controller’s capacitor board with a known good board and triggered the failover back to the active controller.

    Unfortunately, we only keep a single spare in stock so will order replacements tomorrow.

    Apologies for the double disruption this evening if your VM is hosted on this SAN.

Leave a Reply