Token Functions
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Token functions are not limited to string input since they work with generic JSON objects and documents.
If any arguments to any of the following functions are MISSING then the result is also MISSING (i.e.
no result is returned).
Similarly, if any of the arguments passed to the functions are NULL or are of the wrong type (e.g.
an integer instead of a string), then NULL is returned as the result.
|
CONTAINS_TOKEN(input_obj, token_expr [, options ])
Since Couchbase Server 4.6.2
- Description
-
Checks whether or not the specified search token
token_expr
is a sub-string of the input objectinput_obj
. - Arguments
-
- input_obj
-
Any JSON object, or any valid expression which evaluates to a string, to search within.
- token_expr
-
A token string, or any valid expression which evaluates to a string, that is being searched for.
- options
-
An optional JSON object to control tokenization. Within options:
-
names is a boolean to include object names (default: true)
-
case is either "lower" or "upper" for case folding (default: no change to the original text)
-
specials is a boolean to include strings with special characters, such as email addresses and URLs (default: false)
-
split is a boolean to split string values into words (default: true)
-
trim is a boolean to trim spaces around unsplit string values (default: true)
-
- Return Value
-
A boolean, representing whether the search expression exists within the input object.
This returns
true
if the sub-string exists within the input string, otherwisefalse
is returned. - Example 1:
-
Finding hotels with the word "Inn" within their name.
SELECT name FROM `travel-sample` WHERE type = "hotel" AND CONTAINS_TOKEN(name, "inn",{"specials":true}) LIMIT 4; Results: [ { "name": "Sportsman Inn" }, { "name": "Keefer's Inn" }, { "name": "Quality Inn King City Hotel" }, { "name": "Premier Inn, Albert Dock" } ]
CONTAINS_TOKEN_LIKE(input_obj, token_expr [, options ])
Since Couchbase Server 4.6.2
- Description
-
Checks whether or not the specified search token
token_expr
is a sub-string of the input objectinput_obj
. - Arguments
-
- input_obj
-
Any JSON object, or any valid expression which evaluates to a string, to search within.
- token_expr
-
A token string, or any valid expression which evaluates to a string, that is being searched for.
- options
-
An optional JSON object to control tokenization. Within options:
-
names is a boolean to include object names (default: true)
-
case is either "lower" or "upper" for case folding (default: no change to the original text)
-
specials is a boolean to include strings with special characters, such as email addresses and URLs (default: false)
-
split is a boolean to split string values into words (default: true)
-
trim is a boolean to trim spaces around unsplit string values (default: true)
-
- Return Value
-
A boolean, representing whether the search expression exists within the input object.
This returns
true
if the sub-string exists within the input string, otherwisefalse
is returned. - Example 2:
-
Finding email addresses of UK hotels.
SELECT email FROM `travel-sample` WHERE type = "hotel" AND CONTAINS_TOKEN_LIKE(email, "%uk",{"specials":true}) LIMIT 4; [ { "email": "glencoe@syha.org.uk" }, { "email": "owner@hillhousellanrhidian.co.uk" }, { "email": "julia@number38thegower.co.uk" }, { "email": "stay@holiday-harlech.co.uk" } ]
CONTAINS_TOKEN_REGEXP(input_obj, token_expr [, options ])
Since Couchbase Server 4.6.2
- Description
-
Checks whether or not the specified search token
token_expr
is a sub-string of the input objectinput_obj
. - Arguments
-
- input_obj
-
Any JSON object, or any valid expression which evaluates to a string, to search within.
- token_expr
-
A token string, or any valid expression which evaluates to a string, that is being searched for.
- options
-
An optional JSON object to control tokenization. Within options:
-
names is a boolean to include object names (default: true)
-
case is either "lower" or "upper" for case folding (default: no change to the original text)
-
specials is a boolean to include strings with special characters, such as email addresses and URLs (default: false)
-
split is a boolean to split string values into words (default: true)
-
trim is a boolean to trim spaces around unsplit string values (default: true)
-
- Return Value
-
A boolean, representing whether the search expression exists within the input object.
This returns
true
if the sub-string exists within the input string, otherwisefalse
is returned. - Example 3:
-
Finding hotels with the word "Inn" within their name.
SELECT name FROM `travel-sample` WHERE type = "hotel" AND CONTAINS_TOKEN_REGEXP(name, "In+.*",{"specials":true}) LIMIT 4; Results: [ { "name": "Sportsman Inn" }, { "name": "Inveraray Youth Hostel" }, { "name": "Inverness Youth Hostel" }, { "name": "Indian Cove Campground" } ]
HAS_TOKEN(input_obj, token_expr [, options ])
Since Couchbase Server 4.6.2
Alias for CONTAINS_TOKEN().
TOKENS(input_string [, options ])
- Description
-
This function tokenizes (i.e. breaks up into meaningful segments) the given input string based on specified delimiters, and other options. It recursively enumerates all tokens in a JSON value and returns an array of values (JSON atomic values) as the result.
- Arguments
-
- input_string
-
A valid JSON object, this can be anything: constant literal, simple JSON value, JSON key name or the whole document itself.
The following table lists the rules for each JSON type:
JSON Type Return Value MISSING
[]
NULL
[NULL]
false
[false]
true
[true]
number
[number]
string
SPLIT(string)
array
FLATTEN(TOKENS(element) for each element in array
(Concatenation of element tokens)
object
For each name-value pair, name+TOKENS(value)
- options
-
A JSON object indicating the options passed to the
TOKENS()
function. Options can take the following options, and each invocation ofTOKENS()
can choose one or more of the options:-
{"name": true}: Optional. Valid values are
true
orfalse
. By default, this is set to true, andTOKENS()
will include field names. You can choose to not include field names by setting this option tofalse
. -
{"case":"lower"}: Optional. Valid values are
lower
orupper
. Default is neither, that is, it returns the case of the original data. Use this option to specify the case sensitivity. -
{"specials": true}:Optional. Use this option to preserve strings with specials characters, such as email addresses, URLs, and hyphenated phone numbers. The default value is
false
.The specials
options preserves special characters except at the end of a word.
-
- Return Value
-
An array of strings containing all of the tokens obtained from the input string.
- Examples
-
SELECT TOKENS( ['jim@abc.com, kim@abc.com, http://abc.com/, 408-555-1212'], { 'specials': false }); Results: [ { "$1": [ "kim", "408", "com", "abc", "jim", "1212", "555", "http" ] } ] SELECT TOKENS(['jim@abc.com, kim@abc.com, http://abc.com/, 408-555-1212'], { 'specials': true }); Results: [ { "$1": [ "jim", "408", "1212", "jim@abc.com", "http", "abc", "408-555-1212", "kim@abc.com", "kim", "555", "http://abc.com", "com" ] } ]
In the following example,
TOKENS()
converts all of the URL data into UPPER case and also adds the full URL to the delimited words.SELECT TOKENS(url) AS defaulttoken, TOKENS(url, {"specials":true, "case":"UPPER"}) AS specialtoken FROM `travel-sample` h WHERE h.type = 'hotel' LIMIT 1; Results: [ { "defaulttoken": [ "uk", "www", "http", "yha", "org" ], "specialtoken": [ "ORG", "WWW", "HTTP://WWW.YHA.ORG.UK", "YHA", "UK", "HTTP" ] } ]
You can also use
{"case":"lower"}
or{"case":"upper"}
to have case sensitive search. Index creation and querying can use this and other parameters in combination. These parameters should be passed within the query predicates as well. The parameters and values have to match exactly for N1QL to pick up and use the index correctly. The following example shows how you can create an index and use it your application.CREATE INDEX idx_url_upper_special on `travel-sample`( DISTINCT ARRAY v for v in tokens(url, {"specials":true, "case":"UPPER"}) END ) where type = 'hotel';
SELECT name, address, url FROM `travel-sample` h WHERE ANY v in tokens(url, {"specials":true, "case":"UPPER"}) SATISFIES v = "HTTP://WWW.YHA.ORG.UK" END AND h.type = 'hotel' ; Results: [ { "address": "Capstone Road, ME7 3JE", "name": "Medway Youth Hostel", "url": "http://www.yha.org.uk" } ]