Index Partitioning

Index Partitioning enables you to increase aggregate query performance by dividing and spreading a large index of documents across multiple nodes, horizontally scaling out an index as needed. The system partitions the index across a number of index nodes using a hash partitioning strategy in a way that is transparent to queries.

(Introduced in Couchbase Server 5.5 Enterprise Edition)

Benefits of a partitioned index include:

The ability to scale out horizontally as the index size increases.
Transparency to queries, requiring no change to existing queries.
Reduction of query latency for large aggregated queries since each partition can be scanned in parallel.
Provides a low-latency range query while allowing indexes to be scaled out as needed.

Create Partitioned Index

Figure 1. Syntax

CREATE INDEX idx_name ON bucket_name ( field_name [ , field_name2 ] * )
    PARTITION BY HASH partition_key_expr
    [ WITH { with_expr [ , with_expr2 ] * } ];

Arguments

idx_name

Identifier representing the name of your index. The partitioned index name must unique within a bucket.

bucket_name

String representing the bucket to be indexed.

If there is a hyphen (-) inside the bucket name, backticks (` `) are needed around the bucket name.

( field_name )

String of one or more field names (comma-separated) to be included in the index.

The field name(s) must be enclosed inside parentheses.

( partition_key_expr )

One or more fields or an expression of one or more fields representing the partition keys.

This expression must be enclosed inside parentheses.

WITH { }

[Optional] The reserved word to be used with one or more of the following arguments.

with_expr

Required if WITH {} is present. The with_expr must contain at least one of the following arguments.

"num_partition": partition_int: The integer partition_int defines the number of partitions to divide into. The default value is 8. For details and examples, see Number of Partitions.
"nodes": ["###.###.###.###:####", ...]: The node list to restrict the set of nodes available for placement, separated by commas. For details and examples, see Partition Placement.
"defer_build": true | false: When defer_build is set to true, the index creation operation queues the task for building the index but immediately pauses the building of the index of type GSI. Index building requires an expensive scan operation and deferring building of an index with multiple indexes can optimize the expensive scan operation. Administrators can defer building multiple indexes and use the BUILD INDEX statement to build multiple indexes efficiently with one efficient scan of the bucket data. When defer_build is set to false, the CREATE INDEX operation queues the task for building the index and immediately kicks off the building of the index of type GSI.
"num_replica": num_replica_num: The number of replicas of the partitioned index to create. The index service will automatically distribute these indexes amongst the index nodes in the cluster for load balancing and high availability purposes. When creating an index with replicas in this manner, the index service attempts to distribute the replicas based on the server groups in use in the cluster where possible. If num_replica_num is not less than the number of index nodes in the cluster, then the index creation will fail.
"secKeySize": sec_int: The average length of the combined index keys. For details and examples, see Sizing Hints.
"docKeySize": doc_int: The average length of the document key. For details and examples, see Sizing Hints.
"arrSize": arr_int: The average length of the aray fields. For details and examples, see Sizing Hints.

Partition Keys

Partition keys are made up of one or more terms, with each term being the document key, a document field, or an expression of document key or field. The partition keys are hashed to generate a partition ID for each document. The partition ID is then used to identify the partition in which the document’s index keys would reside.

The partition keys should be immutable, that is, its values shouldn’t change once the document is created. For example, the 'travel-sample' field type almost never changes and is therefore a good candidate for partition key. If the partition keys have changed, then the corresponding document should be deleted and recreated with the new partition keys.

Each term in the partition keys can be any JSON data type: number, string, boolean, array, object, or NULL. If a term in the partition keys is missing in the document, the term will have a N1QL MISSING value. Partition keys do not support N1QL array expressions, e.g. ARRAY ... FOR ... IN.

The following table lists some examples of partition keys.

Partition Type Example

Partition Type	Example
The document key.	CREATE INDEX idx ON `travel-sample`(country, airline, id) PARTITION BY HASH(META().id);
Any single or multiple immutable field in the defined index.	CREATE INDEX idx ON `travel-sample`(sourceairport,destinationairport, stops, airline, id) PARTITION BY HASH(sourceairport,destinationairport);
Any single or multiple immutable non-leading field in the defined index.	CREATE INDEX idx ON `travel-sample`(airline, sourceairport, destinationairport, stops, id) PARTITION BY HASH(sourceairport, destinationairport);
Any single or multiple immutable document field not defined in the index.	CREATE INDEX idx ON `travel-sample` (sourceairport, stops, airline, id) PARTITION BY HASH (sourceairport, destinationairport)
A function on the index fields, such as `LOWER(), LEAST(), GREATEST(), SUBSTR()`, etc.	CREATE INDEX idx ON `travel-sample`(LOWER(sourceairport), LOWER(destinationairport), stops, airline, id) PARTITION BY HASH(LOWER(sourceairport), LOWER(destinationairport));
A complex expression on the index fields combining functions and operators.	CREATE INDEX idx ON `travel-sample`(POSITION(meta().id,'__')+2, destinationairport, sourceairport, stops, airline, id) PARTITION BY HASH(POSITION(meta().id,'__')+2));

The document key.

CREATE INDEX idx ON `travel-sample`(country, airline, id)
 PARTITION BY HASH(META().id);

Any single or multiple immutable field in the defined index.

CREATE INDEX idx ON `travel-sample`(sourceairport,destinationairport, stops, airline, id)
 PARTITION BY HASH(sourceairport,destinationairport);

Any single or multiple immutable non-leading field in the defined index.

CREATE INDEX idx ON `travel-sample`(airline, sourceairport, destinationairport, stops, id)
 PARTITION BY HASH(sourceairport, destinationairport);

Any single or multiple immutable document field not defined in the index.

CREATE INDEX idx ON `travel-sample` (sourceairport, stops, airline, id)
 PARTITION BY HASH (sourceairport, destinationairport)

A function on the index fields, such as LOWER(), LEAST(), GREATEST(), SUBSTR(), etc.

CREATE INDEX idx ON `travel-sample`(LOWER(sourceairport), LOWER(destinationairport), stops, airline, id)
 PARTITION BY HASH(LOWER(sourceairport), LOWER(destinationairport));

A complex expression on the index fields combining functions and operators.

CREATE INDEX idx ON `travel-sample`(POSITION(meta().id,'__')+2, destinationairport, sourceairport, stops, airline, id)
 PARTITION BY HASH(POSITION(meta().id,'__')+2));

Using Document Keys as Partition Key

The simplest way to create a partitioned index is to use the document key as the partition key.

Example 1. Create a partitioned index with partition key being the document key

CREATE INDEX idx_pe1 ON `travel-sample`(country, airline, id)
 PARTITION BY HASH(META().id);

SELECT airline, id
FROM `travel-sample`
WHERE country="United States"
ORDER BY airline;

With meta().id as the partition key, the index keys are evenly distributed among all the partitions. Every query will gather the qualifying index keys from all the partitions.

Choosing Partition Keys for Range Query

An application has the option to choose the partition key that can minimize latency on a range query for a partitioned index. For example, let’s say a query has an equality predicate based on the field sourceairport and destinationairport. If the index is also partitioned by the index keys on sourceairport and destinationairport, then the query will only need to read a single partition for the given pair of sourceairport and destinationairport. In this case, the application can maintain a low query latency while allowing the partitioned index to scale out as needed.

Example 2. Create a partitioned index with partition keys matching query equality predicate

# Lookup all airlines with non-stop flights from SFO to JFK
CREATE INDEX idx_pe2 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id)
 PARTITION BY HASH (sourceairport, destinationairport);

SELECT airline, id
FROM `travel-sample`
WHERE sourceairport="SFO" AND
destinationairport="JFK" AND
stops == 0
ORDER BY airline;

The partition keys do not have to be the leading index keys in order to select qualifying partitions. As long as the leading index keys are provided along with the partition keys in the predicate, the query engine can still select the qualifying partitions for index scan. The following example scans a single partition with a given pair of sourceairport and destinationairport.

Example 3. Create a partitioned index with partition keys being non-leading index keys

# Lookup all non-stop flights from SFO to JFK for the given airlines
CREATE INDEX idx_pe3 ON `travel-sample` (airline, sourceairport, destinationairport, stops, id)
 PARTITION BY HASH (sourceairport, destinationairport);

SELECT airline, id
FROM `travel-sample`
WHERE airline in ["UA", "AA"] AND
sourceairport="SFO" AND
destinationairport="JFK" AND
stops == 0
ORDER BY airline;

If the partition keys are based on a N1QL expression, then the query predicate should use the same expression for selecting qualifying partitions.

Example 4. Create a partitioned index with partition keys as expressions

# Case-insensitive lookup for all airlines with non-stop flights from SFO to JFK
CREATE INDEX idx_pe4 ON `travel-sample` (LOWER(sourceairport), LOWER(destinationairport), stops, airline, id)
 PARTITION BY HASH (LOWER(sourceairport), LOWER(destinationairport))

SELECT airline, id
FROM `travel-sample`
WHERE LOWER(sourceairport)="sfo" AND
LOWER(destinationairport)="jfk" AND
stops == 0
ORDER BY airline

As with equality predicate in the previous examples, the query engine can select qualifying partitions using an IN clause with matching partitioned keys. The following example scans at most three partitions with sourceairport "SFO", "SJC", or "OAK".

Example 5. Create a partitioned index with partition keys matching query IN clause

# Lookup for all airlines with non-stop flights from SFO, SJC, or OAK to JFK
CREATE INDEX idx_pe5 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id)
 PARTITION BY HASH (sourceairport, destinationairport);

SELECT airline, id
FROM `travel-sample`
WHERE sourceairport in ["SFO", "SJC", "OAK"] AND
destinationairport="JFK" AND
stops == 0
ORDER BY airline;

As shown in the previous examples, in order to allow the query engine to select qualifying partitions, the partition keys must be present as an equality predicate in the query. The following query only has an equality predicate on sourceairport and hence will not be able to select the qualifying partitions without destinationairport. Consequently, this query will gather qualifying index keys from all partitions.

Example 6. Create a partitioned index with non-matching query equality predicate

# Lookup all airlines with non-stop flights from SFO
CREATE INDEX idx_pe6 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id)
 PARTITION BY HASH (sourceairport, destinationairport);

SELECT airline, id
FROM `travel-sample`
WHERE sourceairport="SFO" AND
stops == 0
ORDER BY airline;

Similarly, the following query gathers qualifying index keys from all partitions as destinationairport IS NOT MISSING is not an equality predicate.

Example 7. Create a partitioned index with query non-equality predicate

# Lookup all airlines with non-stop flights from SFO
CREATE INDEX idx_pe7 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id)
 PARTITION BY HASH (sourceairport, destinationairport);

SELECT airline, id
FROM `travel-sample`
WHERE sourceairport="SFO" AND
destinationport is not missing AND
stops == 0
ORDER BY airline;

For the query engine to select qualifying partitions, the partition keys must also be a part of the index keys. The following index always gathers keys from all partitions as destinationairport is not an index key.

Example 8. Create a partitioned index with partition keys not being index keys

# Lookup all airlines with flights from SFO to JFK
CREATE INDEX idx_pe8 ON `travel-sample` (sourceairport, stops, airline, id)
 PARTITION BY HASH (sourceairport, destinationairport);

SELECT airline, id
FROM `travel-sample`
WHERE sourceairport="SFO" AND
destinationairport="JFK"
ORDER BY airline;

When choosing partition keys other than the document key, the size of each partition can potentially be subjected to data skew of the chosen partition keys. For example, for the index in the following example, the partitions containing the major airlines would have more entries since more index keys would end up hashing into the same partition.

CREATE INDEX idx ON `travel-sample`(airline, destinationairport, sourceairport)
 PARTITION BY HASH(airline);

During index rebalancing, the rebalancer takes into account the data skew among the partitions using runtime statistics. It tries to even out resource utilization across the index service nodes by moving the partitions across the nodes when possible.

Choosing Partition Keys for Aggregate Query

As with a range query, when an index is partitioned by document key, an aggregate query can gather the qualifying index keys from all the partitions before performing aggregation in the query engine. Whenever aggregate pushdown optimization is allowed, the query engine will push down "partial aggregate" calculation to each partition. The query engine then computes the final aggregate result from the partial aggregates across all the partitions.

Example 9. Create a partitioned index with partition key being document key

# Find number of fights out of SFO for every destination across all airlines
CREATE INDEX idx_pe9 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id, ARRAY_COUNT(schedule))
 PARTITION BY HASH (meta().id) where type="route";

SELECT sourceairport, destinationairport, SUM(ARRAY_COUNT(schedule))
FROM `travel-sample`
WHERE sourceairport = "SFO"
AND type = "route"
GROUP BY sourceairport, destinationairport;

The choice of partition keys can also improve aggregate query performance when the query engine can push down the "full aggregate" calculation to the index node. In this case, the query engine does not have to recompute the final aggregate result from the index nodes. In addition, certain pushdown optimizations can only be enabled when a full aggregate result is expected from the index node. To enable a full aggregate computation, the index must be created with the following requirements:

The expressions in the GROUP BY clause must match the partition keys.
The expressions in the GROUP BY clause must match the leading index keys.
The partition keys must match the leading index keys.

Example 10. Create a partitioned index with the partition keys for full aggregate pushdown

# Find number of fights out of SFO for every destination across all airlines
CREATE INDEX idx_pe10 ON `travel-sample` (sourceairport, destinationairport, stops, airline, id, ARRAY_COUNT(schedule))
 PARTITION BY HASH (sourceairport, destinationairport) where type="route";

SELECT sourceairport, destinationairport, SUM(ARRAY_COUNT(schedule))
FROM `travel-sample`
WHERE sourceairport = "SFO"
AND type = "route"
GROUP BY sourceairport, destinationairport;

Number of Partitions

The number of index partitions is fixed when the index is created. By default, each index will have 8 partitions. The Administrator can override the number of partitions at index creation time.

Example 11. Create a partitioned index with 16 partitions

CREATE INDEX idx_pe11 ON `travel-sample`(airline, sourceairport, destinationairport)
 PARTITION BY HASH(airline) WITH {"num_partition":16};

Partition Placement

When a partitioned index is created, the partitions are created across available index nodes. During placement of the new index, the index service assumes that each partition has an equal size and places the partitions according to the availability of resources on each node. For example, if an index node has more available free memory than the other nodes, it will assign more partitions to this index node. If the index has a replica, then the replica partition will not be placed onto the same node.

Alternatively, you can specify the node list to restrict the set of nodes available for placement by using a command similar to the following example.

Example 12. Create a partitioned index on specific ports of a node

CREATE INDEX idx_pe12 ON `travel-sample`(airline, sourceairport, destinationairport)
 PARTITION BY KEY(airline) WITH {"nodes":["127.0.0.1:9001", "127.0.0.1:9002"]};

You can optionally provide sizing hints too. Given the sizing hints, the planner uses a formula to estimate the memory and CPU usage of the index. Based on the estimated memory and CPU usage, the planner tries to place the partitions according to the free resources available to each index node.

Table 1. Sizing Hints
Optional Sizing Hint	Description	Example
secKeySize	The average length of the combined index keys	20
docKeySize	The average length of the document key meta().id	20
arrSize	The average length of the array field. Non-array fields will be ignored.	10

To provide sizing estimation, you can use a command similar to the following examples.

Example 13. Create a partitioned index with specific key sizes

CREATE INDEX idx_pe13 ON `travel-sample`(airline, sourceairport, destinationairport)
 PARTITION BY HASH (airline) WITH {"secKeySize":20, "docKeySize":20};

Example 14. Create a partitioned index with specific key and array sizes

CREATE INDEX idx_pe14 ON `travel-sample`(airline, sourceairport, schedule)
 PARTITION BY HASH (airline) WITH {"secKeySize":20, "docKeySize":20, "arrSize": 100};

Partition Replica

A partitioned index can be created with multiple replicas to ensure indexes are online despite node failure. if there are multiple server groups in a cluster, replica partitions will be spread out to each server group whenever possible. If one of the server groups is offline, the remaining replica partitions will be available to serve all queries. Every index replica is available to serve queries. Therefore, index replicas can also be used to load rebalancing of query requests.

Example 15. Create an index with replica

CREATE INDEX idx_pe15 ON `travel-sample`(airline, sourceairport, schedule)
 PARTITION BY HASH (airline) WITH {"num_replica":2};

When an index node fails, any in-flight query requests (serviced by the failed node) will fail since the partial results are already being processed. Any new query requests requiring the lost partition are then serviced by the partitions in the replica.

Rebalancing

When new index nodes are added or removed from the cluster, the rebalance operation attempts to move the index partitions across available index nodes in order to balance resource consumptions. At the time of rebalancing, the rebalance operation gathers statistics from each index. These statistics are fed to an optimization algorithm to determine the possible placement of each partition in order to minimize the variation of resource consumption across index nodes.

The rebalancer will only attempt to balance resource consumption on a best try basis. There are situations where the resource consumption cannot be fully balanced. For example:

The index service will not try to move the index if the cost to move an index across nodes is too high.
A cluster has a mix of non-partitioned indexes and partitioned indexes.
There is data skew in the partitions.

Repairing Failed Partitions

When an index node fails, the index partitions on that node will be lost. The lost partitions can be recovered or repaired when:

The failed node is delta-recovered.
The failed node is rebalanced out of the cluster. The lost partitions on that node can be repaired/rebuilt in other index nodes whenever possible. The lost partitions cannot be repaired when the number of remaining nodes is less than or equal to the number of index replicas.

Performance Considerations

Max_parallelism

Along with aggregate pushdown optimization, an application can further enhance the aggregate query performance by computing aggregation in parallel for each partition in the index service. This can be achieved by specifying the parameter max_parallelism when issuing a query. The value for max_parallelism should match the number of partitions of the index Note than when this is enabled, the index service uses more CPU and memory since the query traffic is increased according to the value set in the parameter max_parallelism.

OFFSET Pushdown

When there are more than one qualifying partitions involved in a range query, the query engine will not push down the OFFSET clause to the index service. Without partition elimination, a partitioned index will have higher overhead for queries with a large OFFSET value. Alternatively, applications can use keyset based pagination with partitioned index to achieve good pagination query performance, detailed in this blog Database Pagination: Using OFFSET and Keyset in N1QL.

For aggregate queries, the query engine will pushdown the OFFSET clause whenever full aggregate result is expected and there is only 1 qualifying partition involved in the query.

LIMIT Pushdown

When there are more than one qualifying partitions involved in a range query, the query engine will pushdown the LIMIT clause by rewriting it to be the sum of values in the LIMIT clause and OFFSET clause.

For aggregate queries, the query engine will pushdown the LIMIT clause whenever a full aggregate result is expected. When there are more than one qualifying partitions involved in an aggregate query, the query engine will pushdown the LIMIT clause by rewriting it to be the sum of values in the LIMIT clause and OFFSET clause.

DISTINCT Aggregate Pushdown

The query engine will not pushdown distinct aggregate calculation to the index node unless full aggregate result is expected.