Reproducible IR experiments with Apache Lucene

Introduction

This project analyzes the Frequency Distributions of Query Terms on ClueWeb09B Collection. The ClueWeb09B dataset consists of about 50 million English pages that were collected in January and February 2009. The dataset is used by four Web Tracks (2009, 2010, 2011, and 2012) of the TREC conference.

This project uses total 200 queries (called topics in TREC jargon) from TREC Web Tracks ran from 2009 to 2012. These queries are created in four years where each year 50 new queries (and their relevance judgments) are published by TREC.

Apache Lucene/Solr is used as a retrieval platform. Stock lucene/solr has many ranking model implementations, including: BM25, Language Models, Divergence from Randomness Models, and Information Based Models. As explained in the write-up, Flexible Ranking feature added to Lucene in Google Summer of Code 2011.

Tools

This project is a flexible framework to conduct retrieval experiments on ClueWeb09-English corpus. Different term-weighting models provided by Lucene/Solr are compared for 200 Web Track information needs.

Configuration parameters are fed to framework as a properties file. It has two main input parameters, location of input documents, other directories are created by the framework itself. Please see Standard Directory Layout.

Framework is distributed as a tar.gz file which can be generated by mvn clean package dependency:copy-dependencies assembly:single command. The tar ball includes an executable script named run.sh and config.properties containing various parameters. When ./run.sh is invoked, simple usage information is displayed. Following arguments are passed to it in order to run one of the following tools.

Standard Directory Layout

The next section documents the directory layout expected/used by this project. In general, each folder contains two outermost folders : KStem and KStemAnchor. These represent KStemming and AnchorText respectively. In folder naming convention, WT stands for Web Track, TT stands for Terabyte Track, etc.

Dependencies

Perl yum install perl
Bzip yum install bzip2
Million Query evaluation tool statAP_MQ_eval_v4.pl requires: yum install perl-CPAN and perl -MCPAN -e'install "LWP::Simple"'
Check where LWP::Simple module is installed on your system and type below line just above the use LWP::Simple statement in the statAP_MQ_eval_v4.pl file.

use lib '/home/iorixxx/perl5/lib/perl5';
use LWP::Simple;

JDK 1.8 or above
Apache Maven 3.0.3 or above
Apache Lucene (Solr) 6.5.0

Author

Please feel free to contact Ahmet Arslan at [email protected] if you have any questions, comments or contributions.

Citation Policy

If you use this library for research purposes, please use the following citation:

@article{
  author = "Arslan, Ahmet and Din{\c{c}}er, Bekir Taner",
  title = "A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms",
  journal = "Information Retrieval Journal",
  year = "2018",
  doi = "10.1007/s10791-018-9347-9",
  url = "https://link.springer.com/article/10.1007/s10791-018-9347-9"
}

Name		Name	Last commit message	Last commit date
Latest commit History 395 Commits
conf		conf
lib		lib
qrels		qrels
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CROSS.md		CROSS.md
LICENSE		LICENSE
LTR.md		LTR.md
README.md		README.md
SemanticStats.txt		SemanticStats.txt
package.sh		package.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproducible IR experiments with Apache Lucene

Introduction

Tools

Standard Directory Layout

Dependencies

Author

Citation Policy

About

Releases

Packages

Languages

License

ptkyldz/lucene-clueweb-retrieval

Folders and files

Latest commit

History

Repository files navigation

Reproducible IR experiments with Apache Lucene

Introduction

Tools

Standard Directory Layout

Dependencies

Author

Citation Policy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages