Upgrade to Lucene 8 #586

jpountz · 2019-03-14T08:09:05Z

I just focused on making tests pass for now. I did two changes that might be
controversial:

I replaced the Axiomatic similarity with Lucene's, assuming that it had been
created because you didn't know that Lucene had an Axiomatic similarity. I can
easily undo this part of the change.
I had to replace DFR's PL2 with another similarity: new optimizations in Lucene
8 (more on this below) require that scores are non-decreasing when the term freq
increases or when the field length decreases, which was not possible with model P.
So I switched to I(n)L2 instead in SearchArgs.

The main release highlight of Lucene 8 is that it optimized query execution for the
case that users only care about top hits, not hit counts. It does so by indexing scoring
impacts alongside skip data and implementing block-max WAND (S. Ding, T. Suel,
Faster top-k document retrieval using block-max indexes, in: SIGIR, 2011). This is
expected to make retrieval more efficient.

Another change that might be interesting to this project is the new FeatureField
which allows to integrate static features into the score easily and efficiently as it is
well integrated with Lucene's block-max WAND support.

lintool · 2019-03-15T00:22:35Z

Hi @jpountz thanks for your contributions!

For a complete upgrade, we'll need to run all regression tests to update the effectiveness scores... but this will be a great help is getting us started.

Do you know if Lucene8 can work with existing Lucene7 indexes? Or will we need to index everything from scratch again?

jpountz · 2019-03-15T07:45:13Z

Lucene 8 can read indices created by Lucene 7 indeed.

jpountz · 2019-03-26T13:15:23Z

@lintool Please let me know if there is any way that I can help.

lintool · 2019-04-03T20:45:17Z

Hi @jpountz - Thanks so much for your contributions!

We have a bunch of regression tests on various test collections that need to be updated. I've pulled your branch and started working on that:

https://github.com/castorini/Anserini/tree/lucene8

Once I fix those, I'll issue PRs against the lucene8 branch. When stable, we'll merge back into master.

Students have a bunch of papers under review dependent on the current master for repeatability, etc. We'll need to find a good time to merge (e.g., between paper deadlines) so we don't yank the rug from underneath them...

In the meantime, we can continue developing on the lucene8 branch.

Does that sound okay as a plan?

jpountz · 2019-04-04T07:45:04Z

Sure, anything that works for you works for me too.

lintool · 2019-04-05T12:54:33Z

@jpountz Hey, who implemented Block Max WAND? Can you point me to a JIRA issue? Are there any benchmarks for comparison? If no, I might be able to whip something up... Anserini is set up to do something like that fairly easily...

jpountz · 2019-04-05T13:57:31Z

@lintool This is something I worked on, with help and pointers from @jimczi, @rmuir and Stefan Pohl. This blog post gives some history if you are interested https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand. The main issues are https://issues.apache.org/jira/browse/LUCENE-4100 (labeled MAXSCORE, but we eventually implemented WAND), https://issues.apache.org/jira/browse/LUCENE-4198 (make Lucene able to index impacts), and https://issues.apache.org/jira/browse/LUCENE-8135 (implement BMW).

The only benchmarks we have for now are on a Wikipedia index, you can see annotation CJ on the following charts:

Term queries (leveraging indexed impacts) http://people.apache.org/~mikemccand/lucenebench/Term.html
Disjunction of frequent terms (leverages Block-max AND) http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html
Disjunction of a frequent term with a rarer term http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html
Conjunction of two frequent terms (using block-max AND, a variant of block-max WAND which is documented at the end of the BMW paper): http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html
Conjunction of a frequent term with a rarer term (using block-max AND): http://people.apache.org/~mikemccand/lucenebench/AndHighMed.html

Unfortunately it doesn't make all queries faster. For instance, we also have nightly benchmarks for disjunctions within conjunctions but thy are not consistently faster depending on the document frequencies of the involved terms (http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html).

If you can run something, I would be very interested in the results.

lintool · 2019-04-05T14:03:27Z

Okay, I'll throw this on my stack to look at. BTW, does Torsten know about this?

Might be of interest to you: https://dl.acm.org/citation.cfm?id=3018726
(If you're paywall'ed let me know...)

jpountz · 2019-04-05T14:17:50Z

Yes, I have sent a couple updates about this work to Torsten and Shuai. Let me read that paper. :)

lintool · 2019-04-05T14:20:04Z

Hey, can you drop me a line offline? I have a few ideas, but not for public consumption :)

jpountz · 2019-04-05T14:42:10Z

Done.

lintool · 2019-04-25T16:19:32Z

We've established lucene8 as the working branch for moving Anserini to Lucene8. Closing this PR for now, as we'll merge in the lucene8 branch at some opportune future moment in time.

jpountz added 2 commits March 14, 2019 08:48

Upgrade to Lucene 8.

2a1585f

Use Lucene`s Axiomatic similarity.

3b44a7e

lintool closed this Apr 25, 2019

This was referenced Jun 8, 2019

Upgrade to Lucene8 #679

Closed

Implement additional relevance feedback models #657

Closed

crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022

Add documentation of storage on tuna/orca (castorini#586)

bb31950

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to Lucene 8 #586

Upgrade to Lucene 8 #586

jpountz commented Mar 14, 2019

lintool commented Mar 15, 2019

jpountz commented Mar 15, 2019

jpountz commented Mar 26, 2019

lintool commented Apr 3, 2019

jpountz commented Apr 4, 2019

lintool commented Apr 5, 2019

jpountz commented Apr 5, 2019

lintool commented Apr 5, 2019

jpountz commented Apr 5, 2019

lintool commented Apr 5, 2019

jpountz commented Apr 5, 2019

lintool commented Apr 25, 2019

Upgrade to Lucene 8 #586

Upgrade to Lucene 8 #586

Conversation

jpountz commented Mar 14, 2019

lintool commented Mar 15, 2019

jpountz commented Mar 15, 2019

jpountz commented Mar 26, 2019

lintool commented Apr 3, 2019

jpountz commented Apr 4, 2019

lintool commented Apr 5, 2019

jpountz commented Apr 5, 2019

lintool commented Apr 5, 2019

jpountz commented Apr 5, 2019

lintool commented Apr 5, 2019

jpountz commented Apr 5, 2019

lintool commented Apr 25, 2019