本文共 4884 字,大约阅读时间需要 16 分钟。
The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.
For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).
In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
From day one provided a solid datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N documents for display purposes. So far so good! However, the retrieval process is essentially limited to the information available in the inverted index like , boosts and normalization factors. So what if you need custom information to score or filter documents? Stored fields are designed for bulk read, meaning the perform best if you load all their data while during document retrieval we need more fine grained data.
Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.
FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look. However, there are special cases where other datastructures are used in FieldCache but those are out of scope in this post.
摘自:http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:
doc[0] = 1005 doc[1] = 1006 doc[2] = 1005In this example the field would use around 1 bit per document, since that is all that is needed.
doc[0] = "aardvark" doc[1] = "beaver" doc[2] = "aardvark"Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
doc[0] = 0 doc[1] = 1 doc[2] = 0 term[0] = "aardvark" term[1] = "beaver"
doc[0] = "cat", "aardvark", "beaver", "aardvark" doc[1] = doc[2] = "cat"Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
doc[0] = [0, 1, 2] doc[1] = [] doc[2] = [2] term[0] = "aardvark" term[1] = "beaver" term[2] = "cat"