When using Lucene in Java or Scala, you may be tempted to skip the QueryParser and use the “DAO” (for lack of a better term) to construct queries using the classes provided. It is generally a best practice to use DAOs and such abstractions when available over raw query compilation for a variety of reasons, foremost being security (implicit injection protection) and query syntax integrity.
However, you may experience perplexing, incorrect result sets with your Lucene query if the following circumstances are true:
- Your index is written with an analyzer other than the default StandardAnalyzer (e.g. EnglishAnalyzer or any of the plethora of others).
- Your query is a boolean query with n number of OR (aka SHOULD) clauses where n ≥ 2.
- Your query requires a minimum m number of boolean clauses should match where m ≥ n.
Ordinary query, incorrect results
Here is a simple example of a query that exhibits the latter two circumstances above as built entirely with the DAO (code examples henceforth using Scala for brevity):
1 2 3 4 5 6 7 8 |
// Boolean query with OR clauses val q = new BooleanQuery q.add(new TermQuery(new Term("articleTitle", "thanks")), BooleanClause.Occur.SHOULD) q.add(new TermQuery(new Term("articleTitle", "obama")), BooleanClause.Occur.SHOULD) q.add(new TermQuery(new Term("articleTitle", "barack")), BooleanClause.Occur.SHOULD) // Match at least 2 of the clauses q.setMinimumNumberShouldMatch(2) |
This query in plain English means “find documents that contain at least 2 of the terms thanks, obama, and barack.”
Now imagine an index of documents as follows written with EnglishAnalyzer (or some other non-StandardAnalyzer):
1 2 3 4 5 |
doc articleTitle ================= 1 president jails congress; thanks, obama! 2 obama thanks al qaeda for joining in the fight against isis 3 obama to produce presidential library consisting entirely of ebooks |
Running the query on the above documents should yield 2 hits—documents #1 and #2. However, you will receive 0 results in Lucene 3.5.0 (and possibly other versions; did not check).
Unfortunately for me, I was stuck with Lucene 3.5.0 in this particular codebase. Luckily I found a way to sidestep the bug by avoiding the DAO for at least part of the query construction.
Same query, but without DAO (and working now!)
1 2 3 4 5 6 7 8 9 |
val analyzer = new EnglishAnalyzer // or the same analyzer used to write the index val qp = new QueryParser(Version.LUCENE_35, "articleTitle", analyzer) val q = new BooleanQuery q.add(qp.parse("thanks"), BooleanClause.Occur.SHOULD) q.add(qp.parse("obama"), BooleanClause.Occur.SHOULD) q.add(qp.parse("barack"), BooleanClause.Occur.SHOULD) // Match at least 2 of the terms q.setMinimumNumberShouldMatch(2) |
Surprise, surprise, this works! Documents #1 and #2 from before will match as expected.
Note on protecting against query injection
If you must use QueryParser.parse as in the case above, you should also make it a habit to use QueryParser.escape (a static method) on the string you pass to the parse method (e.g. myQueryParser.parse(QueryParser.escape("potentially dangerous user input")) ). The reasons are beyond the scope of this post; just Google “query injection” and pick one of the endless writings on that.