Java Lucene Scala

Beware of Lucene DAO when constructing queries

When using Lucene in Java or Scala, you may be tempted to skip the QueryParser and use the “DAO” (for lack of a better term) to construct queries using the classes provided. It is generally a best practice to use DAOs and such abstractions when available over raw query compilation for a variety of reasons, foremost being security (implicit injection protection) and query syntax integrity.

However, you may experience perplexing, incorrect result sets with your Lucene query if the following circumstances are true:

  • Your index is written with an analyzer other than the default StandardAnalyzer (e.g. EnglishAnalyzer or any of the plethora of others).
  • Your query is a boolean query with number of OR (aka SHOULD) clauses where ≥ 2.
  • Your query requires a minimum m number of boolean clauses should match where m ≥ n.

Ordinary query, incorrect results

Here is a simple example of a query that exhibits the latter two circumstances above as built entirely with the DAO (code examples henceforth using Scala for brevity):

This query in plain English means “find documents that contain at least 2 of the terms thanksobama, and barack.”

Now imagine an index of documents as follows written with EnglishAnalyzer (or some other non-StandardAnalyzer):

Running the query on the above documents should yield 2 hits—documents #1 and #2. However, you will receive 0 results in Lucene 3.5.0 (and possibly other versions; did not check).

Unfortunately for me, I was stuck with Lucene 3.5.0 in this particular codebase. Luckily I found a way to sidestep the bug by avoiding the DAO for at least part of the query construction.

Same query, but without DAO (and working now!)

Surprise, surprise, this works! Documents #1 and #2 from before will match as expected.

Note on protecting against query injection

If you must use QueryParser.parse  as in the case above, you should also make it a habit to use  QueryParser.escape (a static method) on the string you pass to the parse method (e.g. myQueryParser.parse(QueryParser.escape("potentially dangerous user input")) ). The reasons are beyond the scope of this post; just Google “query injection” and pick one of the endless writings on that.