Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNDescent::query is very slightly non-deterministic when run in parallel #3

Open
Wainberg opened this issue Jul 5, 2024 · 2 comments

Comments

@Wainberg
Copy link
Contributor

Wainberg commented Jul 5, 2024

Here's an example run. The non-determinism is only during parallel execution.

>>> import numpy as np
>>> from nndescent import NNDescent
>>>
>>> np.random.seed(0)
>>> X = np.random.random(size=(1000, 50)).astype(np.float32)
>>> nndescent = NNDescent(X, n_neighbors=20, seed=0, n_threads=2)
>>> a = nndescent.query(X, k=15)[0]
>>> b = nndescent.query(X, k=15)[0]
>>> np.where(a != b)
(array([193, 193, 193, 193, 193, 193, 193, 606, 606, 606, 606, 606, 606,
       752, 752, 752, 752, 752, 752, 752]), array([ 8,  9, 10, 11, 12, 13, 14,  9, 10, 11, 12, 13, 14,  8,  9, 10, 11,
       12, 13, 14]))
>>> a[193]
array([193, 828, 680, 724, 810, 476,  50, 889, 731, 613, 233, 823, 984,
       770, 122], dtype=int32)
>>> b[193]
array([193, 828, 680, 724, 810, 476,  50, 889, 613, 233, 823, 984, 770,
       122, 429], dtype=int32)

Notice how only a small number of rows in a and b are affected, and only columns 8 to the end or 9 to the end.

Fabulous library, by the way!

@brj0
Copy link
Owner

brj0 commented Jul 29, 2024

The query function is not entirely random. It begins by initializing the nearest neighbor matrix with the leaves of the search tree. If fewer than k neighbors are found, it then adds random nodes to ensure sufficient neighbors. Afterward, the nearest neighbors are iteratively refined.
See: https://github.com/brj0/nndescent/blob/main/src/nnd.h#L852-L863

@Wainberg
Copy link
Contributor Author

Ah gotcha! So the fix would be to use a separate random number generator per thread, seeded at the beginning of each call to query(). Having the ability to make the output fully deterministic is crucial for scientific applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants