NGram Search

The NGram Index used in SARIT was made to solve a number of specific problems in a database with Chinese text and was not designed to support a full-fledged search system. It is however well suited for searching in Sanskrit text.

Basically, using the NGram Index is simple, since the string that comprises your search term is found throughout the database when searching. There are a number of possibilities to use Regular Expression syntax, though the NGram Index does not contain a full implementation of Regular Expression syntax.

Search with the NGram Index will be further developed to allow e.g. the concurrent search for several search expressions, i.e. boolean logic.

In the following, quotation marks are used to point to characters only and should not be input in the search expression.

When a character is referred to, note that any combining mark which follows a base character, is a character unto itself, making what presents itself as one graph two (possibly more) "characters".

Quantifiers

A full stop, ".", without any qualifiers, matches a single arbitrary character, including a space.

A full stop, ".", immediately followed by a single question mark, "?", matches either no characters or one character.

A full stop, ".", immediately followed by a single asterisk, "*", matches zero or more characters.

A full stop, ".", immediately followed by a single plus sign, "+", matches one or more characters.

These quantifiers only work with "." and cannot follow other characters. They cannot be combined to make "non-greedy" quantifiers, a concept which does not apply to searches in a database.

A full stop, ".", immediately followed by a sequence of characters that matches the regular expression {[0-9]+,[0-9]+}, matches a number of characters, where the number is no less than the number represented by the series of digits before the comma, and no greater than the number represented by the series of digits following the comma. The last number cannot be left empty.

Characters Sets

An expression "[…]" matches a single character, namely any of the characters enclosed by the brackets.

The string enclosed by the brackets cannot be empty; therefore "]" can be allowed between the brackets, provided that it is the first character. Thus, "[][?]" matches the three characters "[", "]" and "?".

This can be used for searching for variant renderings of a word, e.g. with or without an accent.

Anchoring

A circumflex accent, "^", at the start of the search string matches the start of the element content. A dollar sign, "$", at the end of the search string matches the end of the element content.

Escaping

One can remove the special meaning of any character mentioned above by preceding them by a backslash. Between brackets these characters stand for themselves. Thus, "[[?*\]" matches the four characters "[", "?", "*" and "\".

"?", "*", "+" and character sequences matching the regular expression {[0-9]+,[0-9]+} not immediately preceded by an unescaped period, ".", stand for themselves. "^" and "$" not at the very beginning or end of the search string, respectively, stand for themselves.