Applicable Products
- QuTS hero h5.3.0 or later
- High Availability Manager 1.0 or later
Definition and Cause
In a high-availability (HA) cluster, split-brain occurs when both nodes lose communication with each other but remain operational independently, and both nodes have assumed the active node role. This may cause data inconsistency or corrupted shared storage, because each node may attempt to take control of shared resources simultaneously.
Common causes of split-brain include:
- Network disconnection between the nodes in the cluster
- Failure of the heartbeat connection
- Unstable or inconsistent network paths
Solution
- Fix the network connection between the nodes.
First check and restore the network connection between the two nodes (for example, the heartbeat connection, switches, network settings).
Only after the connection is restored can the system proceed to verify the cluster status. - Let the system automatically detect the split-brain status.
- Once the nodes reestablish communication, the system exchanges status information between the two nodes.
- If both nodes have assumed the active node role, the system identifies it as a split-brain condition.
- To prevent data corruption, the system stops most services (such as SMB, iSCSI) and displays an error message indicating that split-brain has occurred.
- Recover from split-brain via High Availability Manager.
- Open High Availability Manager.
- Click Recover from Split-Brain to launch the recovery wizard.
In the wizard, you can choose one of the following recovery options:- Option 1: Preserve data on one node only
Select the node to keep, and the other node will be wiped and reset as the passive node. The system will then resynchronize the HA cluster.
This option is suitable when you clearly know which node has the correct data and want to restore the cluster quickly. - Option 2: Preserve data on both nodes
If both nodes contain important data, the system allows one node to resume services first, while the other node is removed from the cluster.
After verifying and reconciling the data, you can manually rejoin the removed node to the cluster.
- Optional: Minimize future split-brain by enabling a quorum server.
If the nodes disconnect from each other but remain connected to the network, a quorum server can still monitor the individual nodes and relay their statuses with each other. This helps reduce the chance of split-brain.
You can configure a quorum server by going to High Availability Manager > Settings > Failover Policy > Quroum Server.
Further Reading
適用產品
- QuTS hero h5.3.0 or later
- High Availability Manager 1.0 or later
定義與原因
在高可用性 (HA) 叢集 中,分裂腦發生在兩個節點失去彼此的通訊但仍然獨立執行時,並且兩個節點都假設自己是主動節點。這可能導致數據不一致或共享的 儲存空間 損壞,因為每個節點可能會同時嘗試控制共享資源。
分裂腦的常見原因包括:
- 叢集 中節點之間的網絡斷開
- 心跳連線失敗
- 不穩定或不一致的網路路徑
解決方案
- 修復節點之間的網絡連接。
首先檢查並恢復兩個節點之間的網絡連接(例如,心跳連接、交換機、網絡設置)。
只有在連接恢復後,系統才能繼續驗證 叢集 狀態。 - 讓系統自動檢測分裂腦狀態。
- 一旦節點重新建立通訊,系統會在兩個節點之間交換狀態資訊。
- 如果兩個節點都假設自己是主動節點,系統會將其識別為分裂腦狀態。
- 為防止資料損壞,系統會停止大多數服務(如 SMB、iSCSI)並顯示錯誤資訊,指示已發生分裂腦。
- 通過 High Availability Manager 從分裂腦中恢復。
- 打開 High Availability Manager。
- 點擊 從分裂腦中恢復 以啟動恢復向導。
在向導中,您可以選擇以下恢復選項之一:- 選項 1:僅保留一個節點上的數據
選擇要保留的節點,另一個節點將被清除並重置為被動節點。然後系統將重新同步 HA 叢集。
此選項適用於當您清楚知道哪個節點具有正確數據並希望快速恢復 叢集 時。 - 選項 2:保留兩個節點上的數據
如果兩個節點都包含重要數據,系統允許一個節點先恢復服務,而另一個節點將從 叢集 中移除。
在驗證和調和數據後,您可以手動將移除的節點重新加入 叢集。
- 可選:通過啟用仲裁服務器來最小化未來的分裂腦。
如果節點彼此斷開但仍然連接到網絡,仲裁服務器仍然可以監控各個節點並中繼它們的狀態。這有助於減少分裂腦的機會。
您可以通過轉到 High Availability Manager > 設定 > 容錯移轉 策略 > 仲裁服務器 來配置仲裁服務器。
進一步閱讀