Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Lucene 8 #586

Closed
wants to merge 2 commits into from
Closed

Upgrade to Lucene 8 #586

wants to merge 2 commits into from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Mar 14, 2019

I just focused on making tests pass for now. I did two changes that might be
controversial:

  • I replaced the Axiomatic similarity with Lucene's, assuming that it had been
    created because you didn't know that Lucene had an Axiomatic similarity. I can
    easily undo this part of the change.
  • I had to replace DFR's PL2 with another similarity: new optimizations in Lucene
    8 (more on this below) require that scores are non-decreasing when the term freq
    increases or when the field length decreases, which was not possible with model P.
    So I switched to I(n)L2 instead in SearchArgs.

The main release highlight of Lucene 8 is that it optimized query execution for the
case that users only care about top hits, not hit counts. It does so by indexing scoring
impacts alongside skip data and implementing block-max WAND (S. Ding, T. Suel,
Faster top-k document retrieval using block-max indexes, in: SIGIR, 2011). This is
expected to make retrieval more efficient.

Another change that might be interesting to this project is the new FeatureField
which allows to integrate static features into the score easily and efficiently as it is
well integrated with Lucene's block-max WAND support.

@lintool
Copy link
Member

lintool commented Mar 15, 2019

Hi @jpountz thanks for your contributions!

For a complete upgrade, we'll need to run all regression tests to update the effectiveness scores... but this will be a great help is getting us started.

Do you know if Lucene8 can work with existing Lucene7 indexes? Or will we need to index everything from scratch again?

@jpountz
Copy link
Contributor Author

jpountz commented Mar 15, 2019

Lucene 8 can read indices created by Lucene 7 indeed.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 26, 2019

@lintool Please let me know if there is any way that I can help.

@lintool
Copy link
Member

lintool commented Apr 3, 2019

Hi @jpountz - Thanks so much for your contributions!

We have a bunch of regression tests on various test collections that need to be updated. I've pulled your branch and started working on that:

https://github.com/castorini/Anserini/tree/lucene8

Once I fix those, I'll issue PRs against the lucene8 branch. When stable, we'll merge back into master.

Students have a bunch of papers under review dependent on the current master for repeatability, etc. We'll need to find a good time to merge (e.g., between paper deadlines) so we don't yank the rug from underneath them...

In the meantime, we can continue developing on the lucene8 branch.

Does that sound okay as a plan?

@jpountz
Copy link
Contributor Author

jpountz commented Apr 4, 2019

Sure, anything that works for you works for me too.

@lintool
Copy link
Member

lintool commented Apr 5, 2019

@jpountz Hey, who implemented Block Max WAND? Can you point me to a JIRA issue? Are there any benchmarks for comparison? If no, I might be able to whip something up... Anserini is set up to do something like that fairly easily...

@jpountz
Copy link
Contributor Author

jpountz commented Apr 5, 2019

@lintool This is something I worked on, with help and pointers from @jimczi, @rmuir and Stefan Pohl. This blog post gives some history if you are interested https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand. The main issues are https://issues.apache.org/jira/browse/LUCENE-4100 (labeled MAXSCORE, but we eventually implemented WAND), https://issues.apache.org/jira/browse/LUCENE-4198 (make Lucene able to index impacts), and https://issues.apache.org/jira/browse/LUCENE-8135 (implement BMW).

The only benchmarks we have for now are on a Wikipedia index, you can see annotation CJ on the following charts:

Unfortunately it doesn't make all queries faster. For instance, we also have nightly benchmarks for disjunctions within conjunctions but thy are not consistently faster depending on the document frequencies of the involved terms (http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html).

If you can run something, I would be very interested in the results.

@lintool
Copy link
Member

lintool commented Apr 5, 2019

Okay, I'll throw this on my stack to look at. BTW, does Torsten know about this?

Might be of interest to you: https://dl.acm.org/citation.cfm?id=3018726
(If you're paywall'ed let me know...)

@jpountz
Copy link
Contributor Author

jpountz commented Apr 5, 2019

Yes, I have sent a couple updates about this work to Torsten and Shuai. Let me read that paper. :)

@lintool
Copy link
Member

lintool commented Apr 5, 2019

Hey, can you drop me a line offline? I have a few ideas, but not for public consumption :)

@jpountz
Copy link
Contributor Author

jpountz commented Apr 5, 2019

Done.

@lintool
Copy link
Member

lintool commented Apr 25, 2019

We've established lucene8 as the working branch for moving Anserini to Lucene8. Closing this PR for now, as we'll merge in the lucene8 branch at some opportune future moment in time.

@lintool lintool closed this Apr 25, 2019
crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants