What is split-brain and how do I recover?
Last modified date:
2025-05-28
Applicable Products
- QuTS hero h5.3.0 or later
- High Availability Manager 1.0 or later
Definition and Cause
In a high-availability (HA) cluster, split-brain occurs when both nodes lose communication with each other but remain operational independently, and both nodes have assumed the active node role. This may cause data inconsistency or corrupted shared storage, because each node may attempt to take control of shared resources simultaneously.
Common causes of split-brain include:
- Network disconnection between the nodes in the cluster
- Failure of the heartbeat connection
- Unstable or inconsistent network paths
Solution
- Fix the network connection between the nodes.
First check and restore the network connection between the two nodes (for example, the heartbeat connection, switches, network settings).
Only after the connection is restored can the system proceed to verify the cluster status. - Let the system automatically detect the split-brain status.
- Once the nodes reestablish communication, the system exchanges status information between the two nodes.
- If both nodes have assumed the active node role, the system identifies it as a split-brain condition.
- To prevent data corruption, the system stops most services (such as SMB, iSCSI) and displays an error message indicating that split-brain has occurred.
- Recover from split-brain via High Availability Manager.
- Open High Availability Manager.
- Click Recover from Split-Brain to launch the recovery wizard.
In the wizard, you can choose one of the following recovery options:- Option 1: Preserve data on one node only
Select the node to keep, and the other node will be wiped and reset as the passive node. The system will then resynchronize the HA cluster.
This option is suitable when you clearly know which node has the correct data and want to restore the cluster quickly. - Option 2: Preserve data on both nodes
If both nodes contain important data, the system allows one node to resume services first, while the other node is removed from the cluster.
After verifying and reconciling the data, you can manually rejoin the removed node to the cluster.
- Option 1: Preserve data on one node only
- Optional: Minimize future split-brain by enabling a quorum server.
If the nodes disconnect from each other but remain connected to the network, a quorum server can still monitor the individual nodes and relay their statuses with each other. This helps reduce the chance of split-brain.
You can configure a quorum server by going to High Availability Manager > Settings > Failover Policy > Quroum Server.