-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to Lucene 8 #586
Upgrade to Lucene 8 #586
Conversation
Hi @jpountz thanks for your contributions! For a complete upgrade, we'll need to run all regression tests to update the effectiveness scores... but this will be a great help is getting us started. Do you know if Lucene8 can work with existing Lucene7 indexes? Or will we need to index everything from scratch again? |
Lucene 8 can read indices created by Lucene 7 indeed. |
@lintool Please let me know if there is any way that I can help. |
Hi @jpountz - Thanks so much for your contributions! We have a bunch of regression tests on various test collections that need to be updated. I've pulled your branch and started working on that: https://github.com/castorini/Anserini/tree/lucene8 Once I fix those, I'll issue PRs against the Students have a bunch of papers under review dependent on the current In the meantime, we can continue developing on the Does that sound okay as a plan? |
Sure, anything that works for you works for me too. |
@jpountz Hey, who implemented Block Max WAND? Can you point me to a JIRA issue? Are there any benchmarks for comparison? If no, I might be able to whip something up... Anserini is set up to do something like that fairly easily... |
@lintool This is something I worked on, with help and pointers from @jimczi, @rmuir and Stefan Pohl. This blog post gives some history if you are interested https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand. The main issues are https://issues.apache.org/jira/browse/LUCENE-4100 (labeled MAXSCORE, but we eventually implemented WAND), https://issues.apache.org/jira/browse/LUCENE-4198 (make Lucene able to index impacts), and https://issues.apache.org/jira/browse/LUCENE-8135 (implement BMW). The only benchmarks we have for now are on a Wikipedia index, you can see annotation
Unfortunately it doesn't make all queries faster. For instance, we also have nightly benchmarks for disjunctions within conjunctions but thy are not consistently faster depending on the document frequencies of the involved terms (http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html). If you can run something, I would be very interested in the results. |
Okay, I'll throw this on my stack to look at. BTW, does Torsten know about this? Might be of interest to you: https://dl.acm.org/citation.cfm?id=3018726 |
Yes, I have sent a couple updates about this work to Torsten and Shuai. Let me read that paper. :) |
Hey, can you drop me a line offline? I have a few ideas, but not for public consumption :) |
Done. |
We've established |
I just focused on making tests pass for now. I did two changes that might be
controversial:
created because you didn't know that Lucene had an Axiomatic similarity. I can
easily undo this part of the change.
8 (more on this below) require that scores are non-decreasing when the term freq
increases or when the field length decreases, which was not possible with model P.
So I switched to
I(n)L2
instead inSearchArgs
.The main release highlight of Lucene 8 is that it optimized query execution for the
case that users only care about top hits, not hit counts. It does so by indexing scoring
impacts alongside skip data and implementing block-max WAND (S. Ding, T. Suel,
Faster top-k document retrieval using block-max indexes, in: SIGIR, 2011). This is
expected to make retrieval more efficient.
Another change that might be interesting to this project is the new
FeatureField
which allows to integrate static features into the score easily and efficiently as it is
well integrated with Lucene's block-max WAND support.