Indexes and Searching

In contrast to search in a word processing document, search in a database goes through an index. Just as in the index of a book, terms are extracted from the text and arranged in a list where each item is keyed to their positions in the text. When searching a database, the search terms are matched against the index and the parts of the text that contain the terms are then displayed.

Sanskrit can be written in various acripts; in SARIT, Devanagari script and IAST romanisation are used. To the database, the fact that different scripts are used does not matter — they are treated in the same way, as Unicode characters. However, no matter whether Devanagari or romanisation is used, there are differences in the way texts are segmented, that is, how texts are divided by the use of spaces (and other characters) into words. This is of great importance when searching SARIT texts, for one kind of index suits texts that are divided (more or less) into words and another kind suits texts that (more or less) are written as one long string of characters. The first of these, the word index, is the Lucene Full-Text index, and the second, the string index, is the NGram index. Whereas the the Lucene Full-Text index contains words, the NGram Index consists of overlapping chunks of text three characters long (tri-grams).

Different kinds of indexes mean different search options. With a Lucene Full-Text index, you can, e.g., search for contexts in which a number of words are close to each other, but with an NGram index, this is not easily possible. With an NGram index, you can use wildcards, but you basically search for what is there, that is, what you input as your search term is what you will retrieve. The important thing to note here is that if you use the NGram index, you will retrieve instances in which your search string occurs inside longer strings in the text, whereas if you use the Lucene Full-Text index (and do not use wildcards), you will find only instances in the text that are identical to your search string.

All SARIT texts are indexed in both ways and you choose which index to use. Since not all texts divide the text into words, you should choose the index that is most suited for your text. This is not done automatically.

All the texts are marked up according to the TEI Guidelines. A text marked up in this way basically divides into a header describing the text edition and the text edition proper. The header, being written mostly in English, will have standard word divisions, so here the Lucene Full-Text index and its associated search options will be profitable to use, but the way word division is implemented in the text proper differs from text to text.

However, though there are these differences, if you are looking simply for a string match of a single search term, and do not mind if the string occurs as part of a larger string, use the defaults and search using the NGram index, optionally with the regex methods described in NGram Search. This is easiest and very fast. If the text is divided into words, you can use the Lucene Full-Text index and its many options. Now, to complicate matters, the Lucene Full-Text index has quite advanced regex options, described in Lucene Regex Search, and one could, if one wished to do so, use these options instead of the NGram Index, but the drawback here is that this would make searches quite slow and resource-demanding. Choose your index wisely!

Word division is a matter of degree. Among the SARIT texts, only the Arthaśāstra can be said to be divided into words in a way which makes the use of the Lucene Full-Text index immediately suitable. For information about the remaining texts, see Word Division. With these texts, you should as a rule use the NGram Index.

Search Targets

A search is performed in a number of fields – to be more exact, in indexes based on certain fields. This contrasts with searches in a word processing document which are made in a document as a whole, as one long string of characters or words.

However, the word "fields" is not adequate. If you search in a bibliography, you might search for a term inside the title field or for an author in the author field, but with texts it is different. Here you might have a name occurring in the middle of a sentence, and this name is wrapped up a name tag, like this <name>Kamalaśīla</name>, so here the name "field" occurs in the middle of some plain text. The correct word to use for this is "element" – text is contained in elements, just as the text string "Kamalaśīla" is enclosed in the "name" element.

Elements come in different kinds: one kind is used to represent the major structural divisions of the text and another is used to wrap around e.g. names inside text in the manner just described. The structural divisions of a text forms a hierarchy, just as a book contains a number of chapters, which contain a number of sections, which contain a number of paragraphs – which then contain the actual text, with or without any inline elements.

SARIT texts are of different kinds, so the structural elements they use are not completely the same. However, all texts divide into a number of divisions, called "div"s. In the table of contents, the top three divs of each text are displayed. Some texts have that number of structural levels, some less and some more, but only three levels are displayed in order to speed up display in the case of texts with very fine-grained structure.

When you search for the co-occurrence of two words, say "svabhāva" and "lakṣaṇa", you do not search in the text as a whole, but you get a hit only if the two words occur in the same text-containing element. A text-containing element might here be a paragraph or a line group. The number of hits you get is not the number of times the search term(s) occur in the text as a whole, but the number of search targets that contain the search term(s) – it may well be that the same paragraph contains several instances of "lakṣaṇa", but you only find the paragraph once, in which case you only have one hit, but several matches.

The number of hits in a text is not the number of times the search term occurs in the text as a whole – it is the number of times a search target in the text (e.g. a paragraph or a div) contains one or more instances of the search term, so if five paragraphs contain your search term, you will get five hits when you search on paragraphs, but only one hit if you are searching on sections and the five paragraphs are contained within one and the same section. Note that each and every instance of the search term(s) is displayed in the hitlist – only the number of hits is affected by the search context.

The search scope here described is the default "narrow" search scope. It searches in the text-containing elements immediately below the structural elements, that is, it searches in paragraphs and suchlike. If you choose, you can search more broadly, in the divisions above these text-containing elements. This means that you search in the lowest "div"s. If you do so, the co-occurrence of two words can span several paragraphs and you may search for words which are within, say, 10 words of each other, though they belong to different paragraphs. Both kind of search scopes are useful for certain tasks, so you should also choose your search scope carefully!

In the index, no matter whether it is a Lucene Full-Text index or an NGram index, index items are lower-cased. This means that there is no way to differentiate between, say, "Hamlet" (the proper name) and "hamlet" (a village), but it also means that you do not have to remember whether the word you search for occurs first in a sentence and therefore is in upper-case. The search terms you input are lower-cased, upper-casing search strings has no effect, but note that some of operators used with Lucene Full-Text search ("AND", "OR", "NOT") have to be upper-case.

With the Lucene Full-Text index you search for words — spaces and punctuation and so on cannot be searched for, only the words themselves. All your input is stripped of punctuation, so you can just as well spare yourself the effort to input them – whether you search for "To be, or not to be" or "to be or not to be" is all the same.

One may ask: if spaces cannot be searched for, what are phrase searches? Don't phrases consist of words with spaces and punctuation in between? Yes, but when you search for a phrase with the Lucene Full-Text index, this is not the same as when you search for a phrase in a word processing document – with Lucene, you actually search for a sequence of words which have no words in between them, so everything is about words after all (and a phrase search is actually a proximity search – more about that later).

To be continued ….