Online edition c 2009 Cambridge UP

Online Edition C 2009 Cambridge Up-Free PDF

  • Date:24 Jun 2020
  • Views:9
  • Downloads:0
  • Pages:18
  • Size:217.51 KB

Share Pdf : Online Edition C 2009 Cambridge Up

Download and Preview : Online Edition C 2009 Cambridge Up

Report CopyRight/DMCA Form For : Online Edition C 2009 Cambridge Up


DRAFT April 1 2009 Cambridge University Press Feedback welcome 49. 3 Dictionaries and tolerant, In Chapters 1 and 2 we developed the ideas underlying inverted indexes. for handling Boolean and proximity queries Here we develop techniques. that are robust to typographical errors in the query as well as alternative. spellings In Section 3 1 we develop data structures that help the search. for terms in the vocabulary in an inverted index In Section 3 2 we study. WILDCARD QUERY the idea of a wildcard query a query such as a e i o u which seeks doc. uments containing any term that includes all the five vowels in sequence. The symbol indicates any possibly empty string of characters Users pose. such queries to a search engine when they are uncertain about how to spell. a query term or seek documents containing variants of a query term for in. stance the query automat would seek documents containing any of the terms. automatic automation and automated, We then turn to other forms of imprecisely posed queries focusing on. spelling errors in Section 3 3 Users make spelling errors either by accident. or because the term they are searching for e g Herman has no unambiguous. spelling in the collection We detail a number of techniques for correcting. spelling errors in queries one term at a time as well as for an entire string. of query terms Finally in Section 3 4 we study a method for seeking vo. cabulary terms that are phonetically close to the query term s This can be. especially useful in cases like the Herman example where the user may not. know how a proper name is spelled in documents in the collection. Because we will develop many variants of inverted indexes in this chapter. we will use sometimes the phrase standard inverted index to mean the inverted. index developed in Chapters 1 and 2 in which each vocabulary term has a. postings list with the documents in the collection. 3 1 Search structures for dictionaries, Given an inverted index and a query our first task is to determine whether. each query term exists in the vocabulary and if so identify the pointer to the. Online edition c 2009 Cambridge UP,50 3 Dictionaries and tolerant retrieval. corresponding postings This vocabulary lookup operation uses a classical. data structure called the dictionary and has two broad classes of solutions. hashing and search trees In the literature of data structures the entries in. the vocabulary in our case terms are often referred to as keys The choice. of solution hashing or search trees is governed by a number of questions. 1 How many keys are we likely to have 2 Is the number likely to remain. static or change a lot and in the case of changes are we likely to only have. new keys inserted or to also have some keys in the dictionary be deleted 3. What are the relative frequencies with which various keys will be accessed. Hashing has been used for dictionary lookup in some search engines Each. vocabulary term key is hashed into an integer over a large enough space. that hash collisions are unlikely collisions if any are resolved by auxiliary. structures that can demand care to maintain 1 At query time we hash each. query term separately and following a pointer to the corresponding post. ings taking into account any logic for resolving hash collisions There is no. easy way to find minor variants of a query term such as the accented and. non accented versions of a word like resume since these could be hashed to. very different integers In particular we cannot seek for instance all terms. beginning with the prefix automat an operation that we will require below. in Section 3 2 Finally in a setting such as the Web where the size of the. vocabulary keeps growing a hash function designed for current needs may. not suffice in a few years time, Search trees overcome many of these issues for instance they permit us.
to enumerate all vocabulary terms beginning with automat The best known. BINARY TREE search tree is the binary tree in which each internal node has two children. The search for a term begins at the root of the tree Each internal node in. cluding the root represents a binary test based on whose outcome the search. proceeds to one of the two sub trees below that node Figure 3 1 gives an ex. ample of a binary search tree used for a dictionary Efficient search with a. number of comparisons that is O log M hinges on the tree being balanced. the numbers of terms under the two sub trees of any node are either equal. or differ by one The principal issue here is that of rebalancing as terms are. inserted into or deleted from the binary search tree it needs to be rebalanced. so that the balance property is maintained, To mitigate rebalancing one approach is to allow the number of sub trees. under an internal node to vary in a fixed interval A search tree commonly. B TREE used for a dictionary is the B tree a search tree in which every internal node. has a number of children in the interval a b where a and b are appropriate. positive integers Figure 3 2 shows an example with a 2 and b 4 Each. branch under an internal node again represents a test for a range of char. 1 So called perfect hash functions are designed to preclude collisions but are rather more com. plicated both to implement and to compute,Online edition c 2009 Cambridge UP. 3 2 Wildcard queries 51, Figure 3 1 A binary search tree In this example the branch at the root partitions. vocabulary terms into two subtrees those whose first letter is between a and m and. acter sequences as in the binary tree example of Figure 3 1 A B tree may. be viewed as collapsing multiple levels of the binary tree into one this. is especially advantageous when some of the dictionary is disk resident in. which case this collapsing serves the function of pre fetching imminent bi. nary tests In such cases the integers a and b are determined by the sizes of. disk blocks Section 3 5 contains pointers to further background on search. trees and B trees, It should be noted that unlike hashing search trees demand that the char. acters used in the document collection have a prescribed ordering for in. stance the 26 letters of the English alphabet are always listed in the specific. order A through Z Some Asian languages such as Chinese do not always. have a unique ordering although by now all languages including Chinese. and Japanese have adopted a standard ordering system for their character. 3 2 Wildcard queries, Wildcard queries are used in any of the following situations 1 the user.
is uncertain of the spelling of a query term e g Sydney vs Sidney which. Online edition c 2009 Cambridge UP,52 3 Dictionaries and tolerant retrieval. Figure 3 2 A B tree In this example every internal node has between 2 and 4. leads to the wildcard query S dney 2 the user is aware of multiple vari. ants of spelling a term and consciously seeks documents containing any of. the variants e g color vs colour 3 the user seeks documents containing. variants of a term that would be caught by stemming but is unsure whether. the search engine performs stemming e g judicial vs judiciary leading to the. wildcard query judicia 4 the user is uncertain of the correct rendition of a. foreign word or phrase e g the query Universit Stuttgart. WILDCARD QUERY A query such as mon is known as a trailing wildcard query because the. symbol occurs only once at the end of the search string A search tree on. the dictionary is a convenient way of handling trailing wildcard queries we. walk down the tree following the symbols m o and n in turn at which point. we can enumerate the set W of terms in the dictionary with the prefix mon. Finally we use W lookups on the standard inverted index to retrieve all. documents containing any term in W, But what about wildcard queries in which the symbol is not constrained. to be at the end of the search string Before handling this general case we. mention a slight generalization of trailing wildcard queries First consider. leading wildcard queries or queries of the form mon Consider a reverse B tree. on the dictionary one in which each root to leaf path of the B tree corre. sponds to a term in the dictionary written backwards thus the term lemon. would in the B tree be represented by the path root n o m e l A walk down. the reverse B tree then enumerates all terms R in the vocabulary with a given. In fact using a regular B tree together with a reverse B tree we can handle. an even more general case wildcard queries in which there is a single sym. bol such as se mon To do this we use the regular B tree to enumerate the set. W of dictionary terms beginning with the prefix se then the reverse B tree to. Online edition c 2009 Cambridge UP,3 2 Wildcard queries 53. enumerate the set R of terms ending with the suffix mon Next we take the. intersection W R of these two sets to arrive at the set of terms that begin. with the prefix se and end with the suffix mon Finally we use the standard. inverted index to retrieve all documents containing any terms in this inter. section We can thus handle wildcard queries that contain a single symbol. using two B trees the normal B tree and a reverse B tree. 3 2 1 General wildcard queries, We now study two techniques for handling general wildcard queries Both. techniques share a common strategy express the given wildcard query qw as. a Boolean query Q on a specially constructed index such that the answer to. Q is a superset of the set of vocabulary terms matching qw Then we check. each term in the answer to Q against qw discarding those vocabulary terms. that do not match qw At this point we have the vocabulary terms matching. qw and can resort to the standard inverted index,Permuterm indexes.
PERMUTERM INDEX Our first special index for general wildcard queries is the permuterm index. a form of inverted index First we introduce a special symbol into our. character set to mark the end of a term Thus the term hello is shown here as. the augmented term hello Next we construct a permuterm index in which. the various rotations of each term augmented with all link to the original. vocabulary term Figure 3 3 gives an example of such a permuterm index. entry for the term hello, We refer to the set of rotated terms in the permuterm index as the per. muterm vocabulary, How does this index help us with wildcard queries Consider the wildcard. query m n The key is to rotate such a wildcard query so that the symbol. appears at the end of the string thus the rotated wildcard query becomes. n m Next we look up this string in the permuterm index where seeking. n m via a search tree leads to rotations of among others the terms man. Now that the permuterm index enables us to identify the original vocab. ulary terms matching a wildcard query we look up these terms in the stan. dard inverted index to retrieve matching documents We can thus handle. any wildcard query with a single symbol But what about a query such as. fi mo er In this case we first enumerate the terms in the dictionary that are. in the permuterm index of er fi Not all such dictionary terms will have. the string mo in the middle we filter these out by exhaustive enumera. tion checking each candidate to see if it contains mo In this example the. term fishmonger would survive this filtering but filibuster would not We then. Online edition c 2009 Cambridge UP,54 3 Dictionaries and tolerant retrieval. Figure 3 3 A portion of a permuterm index, run the surviving terms through the standard inverted index for document. retrieval One disadvantage of the permuterm index is that its dictionary. becomes quite large including as it does all rotations of each term. Notice the close interplay between the B tree and the permuterm index. above Indeed it suggests that the structure should perhaps be viewed as. a permuterm B tree However we follow traditional terminology here in. describing the permuterm index as distinct from the B tree that allows us to. select the rotations with a given prefix,3 2 2 k gram indexes for wildcard queries.
Whereas the permuterm index is simple it can lead to a considerable blowup. from the number of rotations per term for a dictionary of English terms this. can represent an almost ten fold space increase We now present a second. technique known as the k gram index for processing wildcard queries We. will also use k gram indexes in Section 3 3 4 A k gram is a sequence of k. characters Thus cas ast and stl are all 3 grams occurring in the term castle. We use a special character to denote the beginning or end of a term so the. full set of 3 grams generated for castle is ca cas ast stl tle le. k GRAM INDEX In a k gram index the dictionary contains all k grams that occur in any term. in the vocabulary Each postings list points from a k gram to all vocabulary. Online edition c 2009 Cambridge UP,3 2 Wildcard queries 55. etr beetroot metric petrify retrieval, Figure 3 4 Example of a postings list in a 3 gram index Here the 3 gram etr is. illustrated Matching vocabulary terms are lexicographically ordered in the postings. terms containing that k gram For instance the 3 gram etr would point to vo. cabulary terms such as metric and retrieval An example is given in Figure 3 4. How does such an index help us with wildcard queries Consider the. wildcard query re ve We are seeking documents containing any term that. begins with re and ends with ve Accordingly we run the Boolean q. 2009 Cambridge UP 50 3 Dictionaries and tolerant retrieval corresponding postings This vocabulary lookup operation uses a classical data structure called the dictionary and has two broad classes of solutions hashing and search trees In the literature of data structures the entries in the vocabulary in our case terms are often referred

Related Books