Index Partitioning
Index Partitioning enables you to increase aggregate query performance by dividing and spreading a large index of documents across multiple nodes, horizontally scaling out an index as needed. The system partitions the index across a number of index nodes using a hash partitioning strategy in a way that is transparent to queries.
(Introduced in Couchbase Server 5.5 Enterprise Edition)
Benefits of a partitioned index include:
-
The ability to scale out horizontally as the index size increases.
-
Transparency to queries, requiring no change to existing queries.
-
Reduction of query latency for large aggregated queries since each partition can be scanned in parallel.
-
Provides a low-latency range query while allowing indexes to be scaled out as needed.
Create Partitioned Index

CREATE INDEX idx_name ON bucket_name ( field_name [ , field_name2 ] * ) PARTITION BY HASH partition_key_expr [ WITH { with_expr [ , with_expr2 ] * } ];
Arguments
idx_name
-
Identifier representing the name of your index. The partitioned index name must unique within a bucket.
bucket_name
-
String representing the bucket to be indexed.
If there is a hyphen (-) inside the bucket name, backticks (` `) are needed around the bucket name. ( field_name )
-
String of one or more field names (comma-separated) to be included in the index.
The field name(s) must be enclosed inside parentheses. ( partition_key_expr )
-
One or more fields or an expression of one or more fields representing the partition keys.
This expression must be enclosed inside parentheses. WITH { }
-
[Optional] The reserved word to be used with one or more of the following arguments.
with_expr
-
Required if
WITH {}
is present. The with_expr must contain at least one of the following arguments."num_partition": partition_int
-
The integer partition_int defines the number of partitions to divide into. The default value is 8. For details and examples, see Number of Partitions.
"nodes": ["###.###.###.###:####", ...]
-
The node list to restrict the set of nodes available for placement, separated by commas. For details and examples, see Partition Placement.
"defer_build": true | false
-
When
defer_build
is set to true, the index creation operation queues the task for building the index but immediately pauses the building of the index of type GSI. Index building requires an expensive scan operation and deferring building of an index with multiple indexes can optimize the expensive scan operation. Administrators can defer building multiple indexes and use the BUILD INDEX statement to build multiple indexes efficiently with one efficient scan of the bucket data. Whendefer_build
is set to false, the CREATE INDEX operation queues the task for building the index and immediately kicks off the building of the index of type GSI. "num_replica": num_replica_num
-
The number of replicas of the partitioned index to create. The index service will automatically distribute these indexes amongst the index nodes in the cluster for load balancing and high availability purposes. When creating an index with replicas in this manner, the index service attempts to distribute the replicas based on the server groups in use in the cluster where possible. If num_replica_num is not less than the number of index nodes in the cluster, then the index creation will fail.
"secKeySize": sec_int
-
The average length of the combined index keys. For details and examples, see Sizing Hints.
"docKeySize": doc_int
-
The average length of the document key. For details and examples, see Sizing Hints.
"arrSize": arr_int
-
The average length of the aray fields. For details and examples, see Sizing Hints.
Partition Keys
Partition keys are made up of one or more terms, with each term being the document key, a document field, or an expression of document key or field. The partition keys are hashed to generate a partition ID for each document. The partition ID is then used to identify the partition in which the document’s index keys would reside.
The partition keys should be immutable, that is, its values shouldn’t change once the document is created. For example, the 'travel-sample' field type almost never changes and is therefore a good candidate for partition key. If the partition keys have changed, then the corresponding document should be deleted and recreated with the new partition keys.
Each term in the partition keys can be any JSON data type: number, string, boolean, array, object, or NULL.
If a term in the partition keys is missing in the document, the term will have a N1QL MISSING value.
Partition keys do not support N1QL array expressions, e.g. ARRAY
... FOR
... IN
.
The following table lists some examples of partition keys.
Partition Type | Example |
---|---|
The document key. |
CREATE INDEX idx ON `travel-sample`(country, airline, id) PARTITION BY HASH(META().id); |
Any single or multiple immutable field in the defined index. |
CREATE INDEX idx ON `travel-sample`(sourceairport,destinationairport, stops, airline, id) PARTITION BY HASH(sourceairport,destinationairport); |
Any single or multiple immutable non-leading field in the defined index. |
CREATE INDEX idx ON `travel-sample`(airline, sourceairport, destinationairport, stops, id) PARTITION BY HASH(sourceairport, destinationairport); |
Any single or multiple immutable document field not defined in the index. |
CREATE INDEX idx ON `travel-sample` (sourceairport, stops, airline, id) PARTITION BY HASH (sourceairport, destinationairport) |
A function on the index fields, such as |
CREATE INDEX idx ON `travel-sample`(LOWER(sourceairport), LOWER(destinationairport), stops, airline, id) PARTITION BY HASH(LOWER(sourceairport), LOWER(destinationairport)); |
A complex expression on the index fields combining functions and operators. |
CREATE INDEX idx ON `travel-sample`(POSITION(meta().id,'__')+2, destinationairport, sourceairport, stops, airline, id) PARTITION BY HASH(POSITION(meta().id,'__')+2)); |
Using Document Keys as Partition Key
The simplest way to create a partitioned index is to use the document key as the partition key.
CREATE INDEX idx_pe1 ON `travel-sample`(country, airline, id) PARTITION BY HASH(META().id); SELECT airline, id FROM `travel-sample` WHERE country="United States" ORDER BY airline;
With meta().id
as the partition key, the index keys are evenly distributed among all the partitions.
Every query will gather the qualifying index keys from all the partitions.
Choosing Partition Keys for Range Query
An application has the option to choose the partition key that can minimize latency on a range query for a partitioned index.
For example, let’s say a query has an equality predicate based on the field sourceairport
and destinationairport
.
If the index is also partitioned by the index keys on sourceairport
and destinationairport
, then the query will only need to read a single partition for the given pair of sourceairport
and destinationairport
.
In this case, the application can maintain a low query latency while allowing the partitioned index to scale out as needed.
# Lookup all airlines with non-stop flights from SFO to JFK CREATE INDEX idx_pe2 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id) PARTITION BY HASH (sourceairport, destinationairport); SELECT airline, id FROM `travel-sample` WHERE sourceairport="SFO" AND destinationairport="JFK" AND stops == 0 ORDER BY airline;
The partition keys do not have to be the leading index keys in order to select qualifying partitions.
As long as the leading index keys are provided along with the partition keys in the predicate, the query engine can still select the qualifying partitions for index scan.
The following example scans a single partition with a given pair of sourceairport
and destinationairport
.
# Lookup all non-stop flights from SFO to JFK for the given airlines CREATE INDEX idx_pe3 ON `travel-sample` (airline, sourceairport, destinationairport, stops, id) PARTITION BY HASH (sourceairport, destinationairport); SELECT airline, id FROM `travel-sample` WHERE airline in ["UA", "AA"] AND sourceairport="SFO" AND destinationairport="JFK" AND stops == 0 ORDER BY airline;
If the partition keys are based on a N1QL expression, then the query predicate should use the same expression for selecting qualifying partitions.
# Case-insensitive lookup for all airlines with non-stop flights from SFO to JFK CREATE INDEX idx_pe4 ON `travel-sample` (LOWER(sourceairport), LOWER(destinationairport), stops, airline, id) PARTITION BY HASH (LOWER(sourceairport), LOWER(destinationairport)) SELECT airline, id FROM `travel-sample` WHERE LOWER(sourceairport)="sfo" AND LOWER(destinationairport)="jfk" AND stops == 0 ORDER BY airline
As with equality predicate in the previous examples, the query engine can select qualifying partitions using an IN clause with matching partitioned keys.
The following example scans at most three partitions with sourceairport "SFO"
, "SJC"
, or "OAK"
.
# Lookup for all airlines with non-stop flights from SFO, SJC, or OAK to JFK CREATE INDEX idx_pe5 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id) PARTITION BY HASH (sourceairport, destinationairport); SELECT airline, id FROM `travel-sample` WHERE sourceairport in ["SFO", "SJC", "OAK"] AND destinationairport="JFK" AND stops == 0 ORDER BY airline;
As shown in the previous examples, in order to allow the query engine to select qualifying partitions, the partition keys must be present as an equality predicate in the query.
The following query only has an equality predicate on sourceairport
and hence will not be able to select the qualifying partitions without destinationairport
.
Consequently, this query will gather qualifying index keys from all partitions.
# Lookup all airlines with non-stop flights from SFO CREATE INDEX idx_pe6 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id) PARTITION BY HASH (sourceairport, destinationairport); SELECT airline, id FROM `travel-sample` WHERE sourceairport="SFO" AND stops == 0 ORDER BY airline;
Similarly, the following query gathers qualifying index keys from all partitions as destinationairport IS NOT MISSING
is not an equality predicate.
# Lookup all airlines with non-stop flights from SFO CREATE INDEX idx_pe7 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id) PARTITION BY HASH (sourceairport, destinationairport); SELECT airline, id FROM `travel-sample` WHERE sourceairport="SFO" AND destinationport is not missing AND stops == 0 ORDER BY airline;
For the query engine to select qualifying partitions, the partition keys must also be a part of the index keys.
The following index always gathers keys from all partitions as destinationairport
is not an index key.
# Lookup all airlines with flights from SFO to JFK CREATE INDEX idx_pe8 ON `travel-sample` (sourceairport, stops, airline, id) PARTITION BY HASH (sourceairport, destinationairport); SELECT airline, id FROM `travel-sample` WHERE sourceairport="SFO" AND destinationairport="JFK" ORDER BY airline;
When choosing partition keys other than the document key, the size of each partition can potentially be subjected to data skew of the chosen partition keys. For example, for the index in the following example, the partitions containing the major airlines would have more entries since more index keys would end up hashing into the same partition.
CREATE INDEX idx ON `travel-sample`(airline, destinationairport, sourceairport) PARTITION BY HASH(airline);
During index rebalancing, the rebalancer takes into account the data skew among the partitions using runtime statistics. It tries to even out resource utilization across the index service nodes by moving the partitions across the nodes when possible.
Choosing Partition Keys for Aggregate Query
As with a range query, when an index is partitioned by document key, an aggregate query can gather the qualifying index keys from all the partitions before performing aggregation in the query engine. Whenever aggregate pushdown optimization is allowed, the query engine will push down "partial aggregate" calculation to each partition. The query engine then computes the final aggregate result from the partial aggregates across all the partitions.
# Find number of fights out of SFO for every destination across all airlines CREATE INDEX idx_pe9 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id, ARRAY_COUNT(schedule)) PARTITION BY HASH (meta().id) where type="route"; SELECT sourceairport, destinationairport, SUM(ARRAY_COUNT(schedule)) FROM `travel-sample` WHERE sourceairport = "SFO" AND type = "route" GROUP BY sourceairport, destinationairport;
The choice of partition keys can also improve aggregate query performance when the query engine can push down the "full aggregate" calculation to the index node. In this case, the query engine does not have to recompute the final aggregate result from the index nodes. In addition, certain pushdown optimizations can only be enabled when a full aggregate result is expected from the index node. To enable a full aggregate computation, the index must be created with the following requirements:
-
The expressions in the GROUP BY clause must match the partition keys.
-
The expressions in the GROUP BY clause must match the leading index keys.
-
The partition keys must match the leading index keys.
# Find number of fights out of SFO for every destination across all airlines CREATE INDEX idx_pe10 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id, ARRAY_COUNT(schedule)) PARTITION BY HASH (sourceairport, destinationairport) where type="route"; SELECT sourceairport, destinationairport, SUM(ARRAY_COUNT(schedule)) FROM `travel-sample` WHERE sourceairport = "SFO" AND type = "route" GROUP BY sourceairport, destinationairport;
Number of Partitions
The number of index partitions is fixed when the index is created. By default, each index will have 8 partitions. The Administrator can override the number of partitions at index creation time.
CREATE INDEX idx_pe11 ON `travel-sample`(airline, sourceairport, destinationairport) PARTITION BY HASH(airline) WITH {"num_partition":16};
Partition Placement
When a partitioned index is created, the partitions are created across available index nodes. During placement of the new index, the index service assumes that each partition has an equal size and places the partitions according to the availability of resources on each node. For example, if an index node has more available free memory than the other nodes, it will assign more partitions to this index node. If the index has a replica, then the replica partition will not be placed onto the same node.
Alternatively, you can specify the node list to restrict the set of nodes available for placement by using a command similar to the following example.
CREATE INDEX idx_pe12 ON `travel-sample`(airline, sourceairport, destinationairport) PARTITION BY KEY(airline) WITH {"nodes":["127.0.0.1:9001", "127.0.0.1:9002"]};
You can optionally provide sizing hints too. Given the sizing hints, the planner uses a formula to estimate the memory and CPU usage of the index. Based on the estimated memory and CPU usage, the planner tries to place the partitions according to the free resources available to each index node.
Optional Sizing Hint | Description | Example |
---|---|---|
secKeySize |
The average length of the combined index keys |
20 |
docKeySize |
The average length of the document key meta().id |
20 |
arrSize |
The average length of the array field. Non-array fields will be ignored. |
10 |
To provide sizing estimation, you can use a command similar to the following examples.
CREATE INDEX idx_pe13 ON `travel-sample`(airline, sourceairport, destinationairport) PARTITION BY HASH (airline) WITH {"secKeySize":20, "docKeySize":20};
CREATE INDEX idx_pe14 ON `travel-sample`(airline, sourceairport, schedule) PARTITION BY HASH (airline) WITH {"secKeySize":20, "docKeySize":20, "arrSize": 100};
Partition Replica
A partitioned index can be created with multiple replicas to ensure indexes are online despite node failure. if there are multiple server groups in a cluster, replica partitions will be spread out to each server group whenever possible. If one of the server groups is offline, the remaining replica partitions will be available to serve all queries. Every index replica is available to serve queries. Therefore, index replicas can also be used to load rebalancing of query requests.
CREATE INDEX idx_pe15 ON `travel-sample`(airline, sourceairport, schedule) PARTITION BY HASH (airline) WITH {"num_replica":2};
When an index node fails, any in-flight query requests (serviced by the failed node) will fail since the partial results are already being processed. Any new query requests requiring the lost partition are then serviced by the partitions in the replica.
Rebalancing
When new index nodes are added or removed from the cluster, the rebalance operation attempts to move the index partitions across available index nodes in order to balance resource consumptions. At the time of rebalancing, the rebalance operation gathers statistics from each index. These statistics are fed to an optimization algorithm to determine the possible placement of each partition in order to minimize the variation of resource consumption across index nodes.
The rebalancer will only attempt to balance resource consumption on a best try basis. There are situations where the resource consumption cannot be fully balanced. For example:
-
The index service will not try to move the index if the cost to move an index across nodes is too high.
-
A cluster has a mix of non-partitioned indexes and partitioned indexes.
-
There is data skew in the partitions.
Repairing Failed Partitions
When an index node fails, the index partitions on that node will be lost. The lost partitions can be recovered or repaired when:
-
The failed node is delta-recovered.
-
The failed node is rebalanced out of the cluster. The lost partitions on that node can be repaired/rebuilt in other index nodes whenever possible. The lost partitions cannot be repaired when the number of remaining nodes is less than or equal to the number of index replicas.
Performance Considerations
Max_parallelism
Along with aggregate pushdown optimization, an application can further enhance the aggregate query performance by computing aggregation in parallel for each partition in the index service.
This can be achieved by specifying the parameter max_parallelism
when issuing a query.
The value for max_parallelism
should match the number of partitions of the index Note than when this is enabled, the index service uses more CPU and memory since the query traffic is increased according to the value set in the parameter max_parallelism
.
OFFSET Pushdown
When there are more than one qualifying partitions involved in a range query, the query engine will not push down the OFFSET clause to the index service.
Without partition elimination, a partitioned index will have higher overhead for queries with a large OFFSET value.
Alternatively, applications can use keyset
based pagination with partitioned index to achieve good pagination query performance, detailed in this blog Database Pagination: Using OFFSET and Keyset in N1QL.
For aggregate queries, the query engine will pushdown the OFFSET clause whenever full aggregate result is expected and there is only 1 qualifying partition involved in the query.
LIMIT Pushdown
When there are more than one qualifying partitions involved in a range query, the query engine will pushdown the LIMIT clause by rewriting it to be the sum of values in the LIMIT clause and OFFSET clause.
For aggregate queries, the query engine will pushdown the LIMIT clause whenever a full aggregate result is expected. When there are more than one qualifying partitions involved in an aggregate query, the query engine will pushdown the LIMIT clause by rewriting it to be the sum of values in the LIMIT clause and OFFSET clause.