Failover

Failover is the process whereby a node can be removed from a Couchbase cluster.

Failover Types

There are two basic types of failover: graceful and hard.

  • Graceful: The ability to remove a Data Service node from the cluster proactively, in an orderly and controlled fashion. This is an online operation, requiring zero downtime. When the node to be removed contains the Data Service, the process promotes replica vBuckets on the remaining cluster-nodes to active status, and the active vBuckets on the affected node to dead. Throughout the process, the cluster maintains all 1024 active vBuckets for each bucket.

    Graceful failover can only be used on nodes that run the Data Service. If controlled removal of a non-Data Service node is required, removal should be used.

  • Hard: The ability to drop a node from the cluster reactively, because the node has become unavailable or unstable. When the lost node was running the Data Service, since active vBuckets have been lost, the process promotes replica vBuckets on the remaining cluster-nodes to active status, until 1024 active vBuckets again exist for each bucket.

    Note that if hard failover is applied to an available node running the Data Service, ongoing writes and replications may be interrupted: therefore, in such circumstances, it may be better to Remove a Node and Rebalance.

Graceful failover must be manually initiated. Hard failover can be manually initiated. Hard failover can also be initiated automatically by Couchbase Server: this is known as automatic failover. The Cluster Manager detects the unavailability of a node, and duly initiates a hard failover, without administrator intervention.

Note that when a node is flagged for failover (as opposed to removal), replica vBuckets are lost when rebalance occurs, following node-removal. By contrast, removal creates new copies of replica vBuckets that would otherwise be lost (thereby creating greater competition for the memory resources of the remaining nodes).

Detecting Node-Failure

Hard failover is performed after a node has failed. It can be initiated either by administrative intervention, or through automatic failover.

When automatic failover is used, the Cluster Manager handles both the detection of failure, and the initiation of hard failover, without administrative intervention: however, the Cluster Manager does not identify the cause of failure. Following failover, administrator-intervention is required, to identify and fix problems, and to initiate rebalance, whereby the cluster is returned to a healthy state.

If manual failover is to be used, administrative intervention is required to detect that a failure has occurred. This can be achieved either by assigning an administrator to monitor the cluster; or by creating an externally based monitoring system that uses the Couchbase REST API to monitor the cluster, detect problems, and either provide notifications, or itself trigger failover. Such a system might be designed to take into account system or network components beyond the scope of Couchbase Server.

Node Removal

Node removal uses the rebalance process to remove a node from a cluster in a controlled fashion. It creates on the remaining nodes new copies of replica vBucket that would otherwise be lost when the selected node is taken offline. See Remove a Node and Rebalance.