How to test and understand custom analyzers in Lucene

I've began to work more and more with the great “low-level” library Apache Lucene created by Doug Cutting. For those of you that may not know, Lucene is the indexing and searching library used by great entreprise search servers like Apache Solr and Elasticsearch.

When you start to index and search data, most of the time you need to create a filtering and cleaning pipeline to transform your raw text data into something more indexable and slightly more standardized. Such a pipeline may include lowercasing, transforming to ascii or even stemming (transforming for “eating => eat”). Defining such a pipeline is defining an Analyzer in Lucene-world, and while it's a very easy process to create a new/custom one, tweaking it to your needs is another thing and needs thorough testing.

Today's article is precisely to help you out regarding how to test your own analyzer or even create a simple test case for Lucene's analyzers, to allow you to better understand what they do and why they do it.

Luckily for us, using the latest version Apache Lucene 4.1, we're not left on our own and we can rely on a few tools because Lucene comes with a test framework that needs a few trick to work, so here we go :

You need testing right, so we need to add the dependency org.apache.lucene:lucene-test-framework as a maven artifact,  but not so fast, the test-framework needs to be before lucene-core even if they are in completely different scope, and you need to use at least maven 2.x because otherwise the classpath order won't respect the dependency definition order (what a beautiful world...) :

[code language=”xml”]
<!- must be before lucene core for classpath issues ->

org.apache.lucene
lucene-test-framework
${lucene.version}
test


org.apache.lucene
lucene-core
${lucene.version}

[/code]

But now if you want to create a new JUnit test for testing the behaviour of an analyzer, you've got access to a new Base Class that you can extends called BaseTokenStreamTestCase. But the joy of it all is not exactly to be able to write “public class MyWonderfulTestCase extends BaseTokenStreamTestCase” and clap your hands, now you have access to a brand new class of assertions (by the way you need to enable assertions to execute your tests with the -ea parameter as VM args)

  •  assertTokenStream : it allows you to specify the field on which you're testing (otherwize “dummy” fieldName gets passed onto the analyzer) and check the token stream output;
  • assertAnalyzesTo : you don't specify the field on which you're testing, but it has a simpler syntax.

And there is an example of it all in action :

[code language=”java”]
@Test
public void shouldNotAlterKeywordAnalyzed() throws IOException {
Analyzer myKeywordAnalyzer = new KeywordAnalyzer();

assertTokenStreamContents(
myKeywordAnalyzer.tokenStream(“my_keyword_field”, new StringReader(“ISO8859-1 and all that jazz”)), new String[] {
“ISO8859-1 and all that jazz”
});

assertAnalyzesTo(myKeywordAnalyzer, “ISO8859-1 and all that jazz”, new String[] {
“ISO8859-1 and all that jazz” // a single token output as expected from the KeywordAnalyzer
});
}
[/code]

Hope it will help you out making your search engines more reliable :),

Vale