Highlighting field in memory-based Lucene indexes

tags: #Highlighting #Ighlighting #Indexing #Lucene
categories: Data Java Lucene
published: June 24, 2013
reading time: 2 minutes

I'm using more and more Lucene these days, and getting in depth on a few subjects, today i'm going to talk to you about how to handle the new Highlighting features available with Lucene 4.1.

One of the main achievements with this new version is the creation of the great PostingsHighlighter. Michael McCandless wrote a great piece about it in his article A new Lucene highlighter is born and i encourage you to read it if you want to get serious about highlighting using Lucene :).

Now let's say you want to use it on a MemoryIndex, considering the MemoryIndex as the best In-Memory index type with more than ~500k queries/s handled and the “perfect” reset() method, it would be great right ? But it's a nice dream as the MemoryIndex doesn't store anything about the raw data, so... we need a plan B.

The plan B can be to use the old-fashioned, but still useful, RAMDirectory index that will still behave like a normal “Directory”-based index and will give you the ability to store the data you need on the field to match. Here is an example on how to use it :

[code language=”java”]
final int MAX_DOCS = 10;
final String FIELD_NAME = “text”;
final Directory index = new RAMDirectory();
final StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_41);

IndexWriterConfig writerConfig = new IndexWriterConfig(Version.LUCENE_41, analyzer);
IndexWriter writer = new IndexWriter(index, writerConfig);
// create document
Document document = new Document();
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true); // it needs to be stored to be properly highlighted
type.setTokenized(true);
type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); // necessary for PostingsHighlighter
document.add(new Field(FIELD_NAME, “this an example of text that must be highlighted”, type));
// add it to the index
writer.addDocument(document);
writer.commit();
writer.close();

Query query = new QueryParser(Version.LUCENE_41, FIELD_NAME, analyzer).parse(“example”);
DirectoryReader directoryReader = DirectoryReader.open(index);

IndexSearcher searcher = new IndexSearcher(directoryReader);
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, MAX_DOCS);
String[] strings = highlighter.highlight(FIELD_NAME, query, searcher, topDocs);
System.out.println(Arrays.toString(strings));
// expected output : [this an example of text that must be highlighted]
[/code]

I'm honestly considering right now to use both indexes querying heavily the MemoryIndex and using the RAMDirectory only when i know there's a match found and i need the highlighting features. Maybe i'm not done digging up around this hole and there's a way to make any highlighter work with the MemoryIndex, but i doubt it, both conceptually and after testing everything i could.

If you think otherwise, and know a way to do so, tell me 🙂

Vale