Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Batched indexes #189

Open
yongchanghao opened this issue Jun 15, 2024 · 1 comment
Open

[FEA] Batched indexes #189

yongchanghao opened this issue Jun 15, 2024 · 1 comment
Labels
feature request New feature or request

Comments

@yongchanghao
Copy link

I have a use case where there are multiple indexes on one device. The queries are also batched. For example, if there are N indexes, the query matrix has shape (N, Q, D). The expected return reshape is (N, Q, K).

The brute-force algorithm is pretty easy with numpy/cupy/torch. But is there a plan to implement this for IVF or more algorithms? Also, is there a guide for parallelizing this process using CPU threads?

@yongchanghao yongchanghao added the feature request New feature or request label Jun 15, 2024
@cjnolet
Copy link
Member

cjnolet commented Jul 3, 2024

@yongchanghao your use-case is interesting. We have not considered this exact case, but I wonder if multi-threading the search (and using different streams for each thread) would allow you to do this and improve performance over querying the indecxes individually.

We don't have a guide specifically for parallelizing with thread but we do have this getting started guide that provides a starting point for navigating the various CUDA APIs that we use in cuVS. Assuming you are going to use multiple threads here, it'll be important that you use a unique instance of the underlying raft::device_resources for each thread.

By default, I believe we also enable default_stream_per_thread so the calls to the different indexes should naturally overlap on the GPU (cc @divyegala to correct me if I'm wrong here).

The final consideration you'll want to make is that the allocations/deallocations of temporary memory buffers within the algorithms can cause the whole GPU device to synchronize to the host threads each time. To get around this issue, we use memory pools to allocate a huge chunk of memory up front. We use RMM for the memory pools (also mentioned in the getting started guide) and the memory pool should also be thread-safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants