-
Notifications
You must be signed in to change notification settings - Fork 445
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added document for integration with MatchZoo (#587)
- Loading branch information
1 parent
3a60106
commit 57ff7e8
Showing
1 changed file
with
65 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Neural Information Retrieval with MatchZoo | ||
|
||
This is the document for the intergration between Anserini and MatchZoo. Currently, we support two datasets: Microblog and Robust04. | ||
|
||
## Retrieval + Rerank Pipeline | ||
|
||
### Index Construction | ||
|
||
**Robust04**: | ||
|
||
``` | ||
target/appassembler/bin/IndexCollection -collection TrecCollection \ | ||
-generator JsoupGenerator -threads 16 -input /path/to/robust04 \ | ||
-index lucene-index.robust04.pos+docvectors+rawdocs \ | ||
-storePositions -storeDocvectors -storeRawDocs >& log.robust04.pos+docvectors+rawdocs | ||
``` | ||
|
||
### Prepare Data for MatchZoo | ||
|
||
**Initial Retrieval and Export Data for Neural IR Models** | ||
|
||
``` bash | ||
python src/main/python/rerank/scripts/export_robust04_dataset.py | ||
``` | ||
|
||
**Clone MatchZoo**: | ||
|
||
```bash | ||
cd src/main/python/rerank/ | ||
git clone [email protected]:Victor0118/MatchZoo.git | ||
git checkout rerank | ||
``` | ||
|
||
**Prepare Word Vectors**: | ||
|
||
1. Download the embedding from https://github.com/mmihaltz/word2vec-GoogleNews-vectors | ||
2. Transform the embedding from the word2vec format into the glove format and put it in `src/main/python/rerank/MatchZoo/data/robust04` | ||
|
||
```python | ||
from gensim.models.keyedvectors import KeyedVectors | ||
|
||
model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True) | ||
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False) | ||
``` | ||
|
||
**Move and Process Data**: | ||
|
||
```bash | ||
cd data/robust04 | ||
python prepare_mz_data.py --data_path /path/to/data --train_file /data_path/train_file --dev_file /data_path/dev_file --test_file /data_path/test_file | ||
python gen_w2v.py glove.GoogleNews-vectors-negative300.txt word_dict.txt embed_glove_d300 | ||
cat word_stats.txt | cut -d ' ' -f 1,4 > embed.idf | ||
python gen_hist4drmm.py 20 # histagram bin size | ||
cd ../.. | ||
``` | ||
|
||
**Train, Test and Evaluation**: | ||
|
||
```bash | ||
python matchzoo/main.py --phase train --model_file ./examples/robust04/config/drmm_robust04.config | ||
python matchzoo/main.py --phase predict --model_file ./examples/robust04/config/drmm_robust04.config | ||
cd ../../../../../ | ||
./eval/trec_eval.9.0.4/trec_eval src/main/resources/topics-and-qrels/qrels.robust2004.txt src/main/python/rerank/MatchZoo/data/robust04/predict.test.drmm.txt -m ndcg_cut.20 -m map -m recip_rank -m P.20,30 | ||
``` | ||
|