Automatic Failover

A node or group can be failed over automatically, when it either becomes unavailable or experiences continuous disk-access problems.

Automatic Failover Overview

Automatic Failover — or auto-failover — can be configured to fail over a node or group automatically: no immediate administrator intervention is required. Specifically, the Cluster Manager autonomously detects and verifies that the node or group is unavailable, and then initiates the hard failover process. Auto-failover does not fix or identify problems that may have occurred. Once appropriate fixes have been applied to the cluster by the administrator, a rebalance is required. Automatic failover is always hard failover.

Failover Events

Auto-failover occurs in response to failover events. Events are of three kinds:

Node failure. A server-node within the cluster is unavailable (due to a network failure, out-of-memory problem, or other node-specific issue).
Disk read/write failure. Attempts to read from or write to disk on a particular node have resulted in a significant rate of failure, for longer than a specified time-period. The node is removed by auto-failover, even though the node continues to be contactable.
Group failure. An administrator-defined group of server-nodes within the cluster is unavailable (perhaps due to a network or power failure that has affected an individual, physical rack of machines, or a specific subnet).

Auto-Failover Constraints

Couchbase Server applies the following constraints to auto-failover’s handling of events:

Auto-failover occurs only on the occurrence of one event at a time. If multiple events occur concurrently, or if a second event follows the first prior to auto-failover’s completion, auto-failover is not triggered.
Auto-failover occurs sequentially up to an administrator-specified maximum number of events. After this maximum number of auto-failovers has been triggered, no further auto-failover occurs, until the count is manually reset by the administrator.
Auto-failover is never triggered by Couchbase Server when data-loss might result. Therefore, even a single event may not be responded to; and an administrator-specified maximum number of events may not be reached.
Auto-failover is never triggered when the resulting number of available nodes within the cluster would not constitute a majority of the number available before the problem-occurrence. For example, if one of three nodes or groups were to go offline, auto-failover could be triggered; but if one of two nodes or groups were to go offline, auto-failover could not be triggered. (This prevents the situation where functioning nodes or groups, finding they cannot intercommunicate, fail each other over, and then attempt to continue independently of one another, each representing itself as the cluster.)

Auto-failover can be enabled only when the cluster contains a minimum of three nodes; and should be configured for groups only when a minimum of three groups have been defined. Additionally, auto-failover should be configured only when the cluster contains sufficient resources to handle all possible results: workload-intensity on remaining nodes may increase significantly. Auto-failover is for intra-cluster use only: it does not work with Cross Datacenter Replication.

Auto-failover may take significantly longer if the non-available node is that on which the orchestrator is running, since time-outs must occur before available nodes can elect a new orchestrator-node and thereby continue.

See Email Alerts, for details on configuring email alerts related to failover.

See Server Group Awareness, for information on server groups and group failover.

Configuring Auto-Failover

Auto-failover is configured by means of parameters that include the following.

Timeout. The number of seconds that must elapse, after a node or group has become unavailable, before auto-failover is triggered. The default is 120.
Maximum count. The maximum number of failover events that can occur sequentially and be handled by auto-failover. The maximum-allowed value is 3, the default is 1. This parameter is available in Enterprise Edition only: in Community Edition, the maximum number of failover events that can occur sequentially and be handled by auto-failover is always 1.
Count. The number of failover events that have occurred. The default value is 0. The value is incremented by 1 for every automatic-failover event that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0 through administrator-intervention.
Enablement of disk-related automatic failover; with corresponding time-period. Whether automatic failover is enabled to handle continuous read-write failures. If it is enabled, a number of seconds can also be specified: this is the length of a constantly recurring time-period against which failure-continuity on a particular node is evaluated. If at least 60% of the most recently elapsed instance of the time-period has consisted of continuous failure, failover is automatically triggered. The default value for the enablement of disk-related automatic failover is false. The default value for the corresponding time-period is 120. This parameter is available in Enterprise Edition only.
Group failover enablement. Whether or not groups should be failed over. A group failover is considered to be a single event, even if many nodes are included in the group. The default value is false. This parameter is available in Enterprise Edition only.

For more detailed information, see the documentation provided for specifying Node Availability with Couchbase Web Console UI, for Managing Auto-Failover with the REST API, and cli:cbcli/couchbase-cli-setting-autofailover.adoc with the CLI.