Lucene Search

In the SARIT app, a number of search options are available, making it also possible to perform quite advanced searches. Lucene Full-Text search is based on the open source search engine Apache Lucene and brings into play all the options that Lucene provides.

Search Strategies

There are searches – and searches.

  • One can search by simply filling in words in the query field and press return (or click the magnifying glass).
  • One can use the standard Lucene search options. These consists of marking which words you would like to occur as hits and which word must (or must not) occur, of stringing together words with AND, OR and NOT, grouping words with parentheses, and so on. Quite complicated searches can be made with these options.
  • One can use Lucene wildcards - ? for any single character and * for any sequence of zero, one or more characters.
  • Even more advanced, one can use regular expressions, the most precise (and complicated) way of searching for texts strings.

All of these search options can be used in combination, so one can freely e.g. mix plain, wildcard and regular expression searches.

These three different search strategies will be presented below, after a brief description of the way hits are displayed.

Hitlist and Search Relevance

When searching for books in a library catalogue, one can usually choose to have the hits displayed according to relevance or according to author, title or suchlike. In the SARIT app, hits from Lucene searches are only displayed according to relevance, according to a "score" computed for each search. The score is a quite complicated thing in itself, but basically, the more times your search terms occur in your search scope and the less common they are in the index (that is, in SARIT works) the higher the score they will get and the more prominent they will be in the hitlist.

Standard Lucene Syntax

In the following, examples are drawn from imaginary searches in Shakespeare's works. It is intended that somone knowledgable in Sanskrit convert the examples to something more relavant to SARIT.

If you simply fill in some words in the query field, and do not use any of the special Lucene search operators, you are saying that you would like to see as many of the words in the search scope as possible.

With Lucene standard syntax, there are two ways you can impose restrictions of the co-occurrence of search terms: you can either prefix words with + and - or you use the boolean terms AND, OR and NOT (written in upper-case). In both cases, you can additionally group your search expressions using parentheses.

The option of using + and - is better suited to a search which orders hits according to score. Here you let words stand as they are (without + or -) if you would like all of them to occur in hits, but if there is only one of them present, you also want to have it displayed as a hit If you prefix one or more words with +, they must occur in a hit, and if you prefix them with -, they must not occur as a hit. If you search for snake fillet you get a lot of hits with either snake or killed and perhaps some with both. If you search for "snake +fillet, all your hits will contain fillet, but they may or may not contain snake. If you search for snake -fillet, you would get hits with snake, but only if they do not contain fillet.

If you use AND, OR and NOT, the logic is rather different. If you search for snake AND fillet you get hits with both snake and fillet and none with only one of them. This corresponds to +snake +killed. If you search for snake OR killed, this is the same as simply searching for snake fillet. If you search for snake NOT fillet, this equals searching for snake -fillet.

Searches can acquire higher complexity use of parentheses. Here the use of AND, OR and NOT may come more naturally. Say you want to find passages where the word fillet occurs but where also at least one of the words snake, deer, bird occurs. You can express this by (snake OR deer OR bird) AND fillet. An AND enforces "must occur" on both sides, so both one of the animals and the word fillet have to occur in the hits. Say (for some reason) you do not wish the words pricket and mouse to occur in your hits – you then embroider your search expression with NOT (pricket OR mouse) as (snake OR deer OR bird) AND fillet NOT (pricket OR mouse)

If you simply search for pricket OR deer AND killed, you will (because the AND rubs off to the left), search for passages where deer and fillet must occur, but you would also like pricket to be marked as a hit. You can enforce a certain logic on your query by grouping with parentheses.

If you search for (snake OR deer) AND fillet you are saying that one or both of snake and deer must occur, as must fillet.

If you search for snake OR (deer AND fillet), you would like to retrieve hits where snake occurs and you would like to retrieve hits where deer and fillet go together. In practice this means that you will get a lot of snake-only hits.

You can also nest parentheses, e.g. (snake OR (deer AND fillet)) NOT pricket will remove the hits with pricket from snake OR (deer AND fillet.

As you can see, the options are many …. And as if this was not enough, there is also regex – and regex syntax combined with standard syntax!

Phrase Search

A phrase search searches for a series of words in sequence. You activate phrase search by enclosing the words in quotations marks. If you want to search for all occurences of "fenny snake", you should input the search expression "fenny snake". This is the way searches are performed in word processing documents, except that here punctuation is disregarded. To find "Fire burn, and cauldron bubble", you do not have to input the comma (but it does not harm - it is removed automatically from your search expression).

Proximity Search

A proximity search allows you to search for all of the words in the hits within the search scope, in the order they occur in, but within a certain proximity. You put the search engine in proximity search mode by enclosing the words in quotation marks and adding a tilde ~ after the rightmost quotation mark, immediately followed by the number of words you wish to allow between the two words. If you search for "Fire bubble"~3 you will thus find "Fire burn, and cauldron bubble", but if you only allow two words to intervene, "Fire bubble"~2, you will not. More than two words are allowed: you can retrieve "I pray thee, stay with us: go not to Wittenberg" with "pray stay Wittenberg"~6.

If you wish to search for words within a certain proximity, but regardless of the order they occur in, you have to use an OR expression: "slumber beware"~5 OR "beware slumber"~5.

Fuzzy Search

"Fuzzy Search" needs a little explanation. If you take a word, like "snake", you can make changes and additions to it. One change would thus give you "spake", "slave", "snare" and "snakes". If you make one more edits based on this, you can easily see that a lot of words can be generated. Since this search is very time-consuming, the maximum number of "edits" you can make is 2. To activate fuzzy search, you add a tilde ~ after the word in question, snake~2.

Wildcard Expressions

Wildcard Expressions allow you to mask individual characters or sequences of characters inside words. Say, to switch examples, you wish to find all occurrences of "test" and of "text". Now, you could search for text test (or text OR test), and this would give you what you wanted, but you can also use the wildcard ? and search for te?t instead. This will find all words that consists of the two character "te", plus one character which can be anything, and end up with the character "t". It would also find "teat".

In addition to ? which requires one and one only character, you can also use * which stands for any character zero, one or more times. If you search for te*t, you are therefore likely to find many more words, among them "test" and "text", but also "tempest", "testament" and "tent"

You do not have to signal in any way that you are performing a wildcard search: just including ? or * is enough.

A wildcard character cannot occur first in a search expression, so you cannot find "Hamlet" with ?amlet or *let. If you wish to perform searches of this kind, you should use a regular expression.

Regular Expressions

Regular Expressions are also known as "regex" or "regexp". They are a very powerful tool for searching text (and for replacing text, but this is not relevant here). Lucene only supports a restricted range of regex operators, but they should be sufficient for most uses.

You put the search engine into regex mode by enclosing your search term with slashes. So you would search e.g. for /.{3}let/ to find "Hamlet" (but also, e.g. "fillet").

Match any character

The period . can be used to represent any character (this is the same as the ? wildcard).

In order to retrieve the string "snake", the following expressions can be used:

  • /s.ake/
  • /.nak./

One-or-more

The plus sign "+" can be used to repeat the preceding shortest pattern once or more times.

In order to retrieve the string "deer", the following expression can be used:

  • /de+r/

Zero-or-more

The asterisk * can be used to match the preceding shortest pattern zero-or-more times. Note that this applies to what comes before the asterisk - the wildcard * stands for a character in itself (a wildcard * amounts to a reg .*)

In order to retrieve both the strings "weed" and "wed" (and "welcomed" and "westward" and so on), the following expression can be used:

  • /we*d/

Zero-or-one

The question mark ? makes the preceding shortest pattern optional. It matches zero or one times. Note that in Lucene wildcard searches, ? stands for a character in itself; in regex searches the question mark quantifies the immediately preceding character (or pattern).

In order to retrieve the strings "weed" and "wed", the following expression can be used:

  • /wee?d/

Min-to-max

Curly brackets {} can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:

{5} repeat exactly 5 times
{2,5} repeat at least twice and at most 5 times
{2,} repeat at least twice

In order to retrieve the string "weed", the following expression can be used:

  • /we{2}d/
  • /we{2,}d/
  • /we{2,5}d/

Grouping

Parentheses () can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group.

In order to retrieve the string "weed", the following expression can be used:

  • /w(..)+d/
  • /w(ee)*d/
  • /w(ee)?d/

Alternation

The pipe symbol | acts as an OR operator. The match will succeed if the pattern on either the left-hand side or the right-hand side matches. This is of course equivalent to the OR operator in standard Lucene syntax.

In order to retrieve the strings "proportions" and "preparations", the following expression can be used:

  • /(prepara|propor)tions/

Character classes

Character classes are very important, since they allow you to mask variation with more control than that offered by wildcards. You can thus use them to find words even though they are written differently, e.g. have either "e" or "o" in a certain position or have "a" and "e" in a certain position

Ranges of potential characters may be represented as character classes by enclosing them in square brackets []. A leading caret ^ negates the character class, that is, all characters other than the ones following are signified.

The allowed forms are:

[abc] 'a' or 'b' or 'c'
[a-c] 'a' to 'c'. i.e. 'a' or 'b' or 'c'
[-abc] 'b' or 'c', but not 'a'
[abc\-] 'a' or 'b' or 'c' or '-'
[^abc] any character except 'a' or 'b' or 'c'
[^a-c] any character except 'a' or 'b' or 'c'
[^-abc] any character except '-' or 'a' or 'b' or 'c'

Note that the dash - indicates a range of characters, unless it is the first character or if it is escaped with a backslash.

The caret ^ negates the following characters.

In order to retrieve the string "weed", the following expression could be used:

  • /w[uiaeo]+d/
  • /w[uiaeo]*d/
  • /we[uiaeo]?d/
  • /w[a-u]*ed/
  • /we[^uiao]d/

The possibilities are enormous.

There are plenty of regex tutorials. A good one can be found at regular-expressions.info.

The exact definition of the regex possibilities in Lucene can be found in a Lucene Java doc.