What is split-brain and how do I recover?

Last modified date: 2025-05-28

Applicable Products

QuTS hero h5.3.0 or later
High Availability Manager 1.0 or later

Definition and Cause

In a high-availability (HA) cluster, split-brain occurs when both nodes lose communication with each other but remain operational independently, and both nodes have assumed the active node role. This may cause data inconsistency or corrupted shared storage, because each node may attempt to take control of shared resources simultaneously.

Common causes of split-brain include:

Network disconnection between the nodes in the cluster
Failure of the heartbeat connection
Unstable or inconsistent network paths

Solution

Fix the network connection between the nodes.
First check and restore the network connection between the two nodes (for example, the heartbeat connection, switches, network settings).
Only after the connection is restored can the system proceed to verify the cluster status.
Let the system automatically detect the split-brain status.
1. Once the nodes reestablish communication, the system exchanges status information between the two nodes.
2. If both nodes have assumed the active node role, the system identifies it as a split-brain condition.
3. To prevent data corruption, the system stops most services (such as SMB, iSCSI) and displays an error message indicating that split-brain has occurred.
Recover from split-brain via High Availability Manager.
1. Open High Availability Manager.
2. Click Recover from Split-Brain to launch the recovery wizard.
  In the wizard, you can choose one of the following recovery options:
  - Option 1: Preserve data on one node only
    Select the node to keep, and the other node will be wiped and reset as the passive node. The system will then resynchronize the HA cluster.
    This option is suitable when you clearly know which node has the correct data and want to restore the cluster quickly.
  - Option 2: Preserve data on both nodes
    If both nodes contain important data, the system allows one node to resume services first, while the other node is removed from the cluster.
    After verifying and reconciling the data, you can manually rejoin the removed node to the cluster.
Optional: Minimize future split-brain by enabling a quorum server.
If the nodes disconnect from each other but remain connected to the network, a quorum server can still monitor the individual nodes and relay their statuses with each other. This helps reduce the chance of split-brain.
You can configure a quorum server by going to High Availability Manager > Settings > Failover Policy > Quroum Server.

Storage

Networking

Expansion Accessories

Surveillance

NAS

Networking

Surveillance

Applicable Products

Definition and Cause

Solution

Further Reading

Choose Your Country or Region

Asia

Europe

America

Global