Support Parquet data for HNSWDenseVector #2582

valamuri2020 · 2024-08-29T22:36:40Z

Overview

Adds functionality to run HNSW Indexer with Parquet data. Main functionality added:

json_to_parquet.py
ParquetDenseVectorCollection
ParquetDenseVectorDocumentGenerator

Tested on nfcorpus and robust04 datasets with evals matching.

Steps to reproduce

Env setup

conda create -n parquet && conda activate parquet
cd src/main/python/parquet
pip install -r requirements.txt

Convert data to parquet format and run indexing

Run the following commands from repo root.

Download raw data

wget 'https://www.dropbox.com/scl/fi/1qnwq7s56muwudetqxgez/bge-base-en-v1.5-robust04.tar?rlkey=sd8zt0qnopwgbel43an46bggc&dl=0' -P collections/

Create parquet data

python src/main/python/parquet/json_to_parquet.py --input collections/robust04 --output collections/robust04.parquet/ --overwrite

Create index

bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection ParquetDenseVectorCollection -input collections/robust04.parquet/ -generator ParquetDenseVectorDocumentGenerator -index indexes/parquet-robust04 -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

Run retrieval on index

bin/run.sh io.anserini.search.SearchHnswDenseVectors \
  -index indexes/parquet-robust04/ \
  -topics tools/topics-and-qrels/topics.beir-v1.0.0-robust04.test.tsv \
  -topicReader TsvString \
  -output runs/parquet-robust04.txt \
  -generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15

Run evals

bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-robust04.test.txt runs/parquet-robust04.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-robust04.test.txt runs/parquet-robust04.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-robust04.test.txt runs/parquet-robust04.txt

lintool · 2024-08-29T22:38:40Z

pom.xml

@@ -535,6 +535,58 @@
          <artifactId>spring-boot-starter-logging</artifactId>
        </exclusion>
      </exclusions>
+      </dependency>
+    <dependency>


hrm... can we pull in parquet without pulling in all of hadoop? i.e., minimize the dependencies we drag in...

Yeah agreed, I initially tried to do it without hadoop using parquet-mr, but it was not available in any of the mirrors, and I kept getting Missing artifact org.apache.parquet:parquet-mr:jar:1.12.3.

So I had to fall back to hadoop

what's the size of the fatjar before and after? i.e., how much are you pulling in?

maybe try suppressing the transitive inclusion of jars with exclude tags?

master: 179MB
this branch: 265 MB (after putting in exclude tags)

yikes! that's a lot of bloat for a relatively small feature...

can you poke around to see how we might make more lean?

updated dependencies, it's around 220 MB now

lintool

Also -

Submodule tools

Update to latest in master please.

lintool · 2024-08-29T22:40:23Z

@valamuri2020 good job!

Compare against the safetensor implementation? I'm interested in collection size and also indexing time.

valamuri2020 · 2024-08-29T22:59:33Z

Also -

Submodule tools

Update to latest in master please.

I think it's already updated to the latest? This is what I see at the top of the git log, the commit hash matches what's on the tools repo.

commit 3a2b3cc5cfd915d707408aa5c4567185e9e4544f (HEAD -> master, origin/master, origin/HEAD)
Author: Ronak <[email protected]>
Date:   Wed Aug 7 14:20:18 2024 -0400

    Add RAG 24 Test Topics (#81)

arjenpdevries · 2024-08-30T10:46:12Z

I was just briefly looking at the discussion above, and was wondering whether a subset of the dependencies might be sufficient?

See here: https://github.com/apache/parquet-java?tab=readme-ov-file#add-parquet-as-a-dependency-in-maven

They list 4 separate dependencies, maybe it can work without using the hadoop one?

lintool · 2024-08-31T17:10:25Z

Also -

Submodule tools

Update to latest in master please.

I think it's already updated to the latest? This is what I see at the top of the git log, the commit hash matches what's on the tools repo.
commit 3a2b3cc5cfd915d707408aa5c4567185e9e4544f (HEAD -> master, origin/master, origin/HEAD)
Author: Ronak <[email protected]>
Date:   Wed Aug 7 14:20:18 2024 -0400

    Add RAG 24 Test Topics (#81)

Yes, this is indeed HEAD. Not sure why your diff is showing this then:

Are you missing a git submodule update somewhere?

valamuri2020 · 2024-09-01T21:51:53Z

Are you missing a git submodule update somewhere?

Nope, running that command didn't update anything.

valamuri2020 · 2024-09-01T21:54:49Z

I was just briefly looking at the discussion above, and was wondering whether a subset of the dependencies might be sufficient?

See here: https://github.com/apache/parquet-java?tab=readme-ov-file#add-parquet-as-a-dependency-in-maven

They list 4 separate dependencies, maybe it can work without using the hadoop one?

Thanks for the idea @arjenpdevries! I gave that a try and it took some digging into. Unfortunately, it seems that hadoop is still tightly involved under the hood. The top level API uses LocalInputFile instead of the HadoopFile, but everything under that is still using hadoop, ex. the configuration. I read through the PR and GitHub Issues, it seems like they plan to decouple them in the future, but nothing that is low-bloat and easy to use right now.

lintool · 2024-09-03T13:25:35Z

src/main/java/io/anserini/index/generator/ParquetDenseVectorDocumentGenerator.java

+   * @return the parsed vector as an array of doubles
+   */
+
+  private float[] parseVectorFromString(String contents) {


In Document, we have

private final double[] vector;

Why do we need to parse from String?

The Document interface doesn't expose the vector directly, but instead the contents() as a String. This convention of exposing a String, and parsing the vector in the Generator is also in JsonCollection and DenseVectorDocumentGenerator. It seems that this has been the convention for a while.

lintool · 2024-09-09T13:29:25Z

@valamuri2020

% ls -lh target/anserini-0.37.1-SNAPSHOT-fatjar.jar 
-rw-r--r--  1 jimmylin  staff   179M 26 Aug 19:23 target/anserini-0.37.1-SNAPSHOT-fatjar.jar

What's the latest update on bloat? 220MB? Is that the best we can do? And have we tried excluding as much as we can?

valamuri2020 · 2024-09-09T15:03:42Z

Yep, 226 MB to be exact. That is pretty close to the best we can do - I went through the mvn dependency tree and removed as many of the transitive dependencies as possible.

arjenpdevries · 2024-09-11T18:38:29Z

Perhaps... https://github.com/strategicblue/parquet-floor (I have not checked myself how good this is)

valamuri2020 and others added 6 commits August 29, 2024 18:00

json to parquet conversion working

1acfd86

initial impl for collection and generator, dependencies compiling

5e6c884

indexing and retrieval works end-to-end

5acb4eb

added requirements.txt

8a6a715

cleanup

fe317ce

cleanup

3a56a0a

lintool reviewed Aug 29, 2024

View reviewed changes

lintool requested changes Aug 29, 2024

View reviewed changes

valamuri2020 requested a review from lintool August 29, 2024 23:20

valamuri2020 mentioned this pull request Aug 29, 2024

Anserini: replace verbose json-based vector format with more compact binary encoding castorini/ura-projects#31

Open

added exclusion tags for pom.xml and removed contents field

7ae1038

reducing dependency bloat

7cbf850

lintool reviewed Sep 3, 2024

View reviewed changes

lintool approved these changes Sep 10, 2024

View reviewed changes

lintool merged commit 66d97d1 into castorini:master Sep 10, 2024
1 check passed

lintool mentioned this pull request Sep 10, 2024

HnswDensevector SafeTensor Generator #2515

Closed

lintool mentioned this pull request Sep 11, 2024

Try parquet-floor #2598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parquet data for HNSWDenseVector #2582

Support Parquet data for HNSWDenseVector #2582

valamuri2020 commented Aug 29, 2024

lintool Aug 29, 2024

valamuri2020 Aug 29, 2024

lintool Aug 29, 2024

valamuri2020 Aug 30, 2024

lintool Aug 30, 2024

valamuri2020 Sep 3, 2024

lintool left a comment

lintool commented Aug 29, 2024

valamuri2020 commented Aug 29, 2024

arjenpdevries commented Aug 30, 2024

lintool commented Aug 31, 2024

valamuri2020 commented Sep 1, 2024

valamuri2020 commented Sep 1, 2024 •

edited

Loading

lintool Sep 3, 2024

valamuri2020 Sep 3, 2024

lintool commented Sep 9, 2024

valamuri2020 commented Sep 9, 2024

arjenpdevries commented Sep 11, 2024

Support Parquet data for HNSWDenseVector #2582

Support Parquet data for HNSWDenseVector #2582

Conversation

valamuri2020 commented Aug 29, 2024

Overview

Steps to reproduce

Env setup

Convert data to parquet format and run indexing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintool left a comment

Choose a reason for hiding this comment

lintool commented Aug 29, 2024

valamuri2020 commented Aug 29, 2024

arjenpdevries commented Aug 30, 2024

lintool commented Aug 31, 2024

valamuri2020 commented Sep 1, 2024

valamuri2020 commented Sep 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintool commented Sep 9, 2024

valamuri2020 commented Sep 9, 2024

arjenpdevries commented Sep 11, 2024

valamuri2020 commented Sep 1, 2024 •

edited

Loading