NNDescent::query is very slightly non-deterministic when run in parallel #3

Wainberg · 2024-07-05T05:25:13Z

Here's an example run. The non-determinism is only during parallel execution.

>>> import numpy as np
>>> from nndescent import NNDescent
>>>
>>> np.random.seed(0)
>>> X = np.random.random(size=(1000, 50)).astype(np.float32)
>>> nndescent = NNDescent(X, n_neighbors=20, seed=0, n_threads=2)
>>> a = nndescent.query(X, k=15)[0]
>>> b = nndescent.query(X, k=15)[0]
>>> np.where(a != b)
(array([193, 193, 193, 193, 193, 193, 193, 606, 606, 606, 606, 606, 606,
       752, 752, 752, 752, 752, 752, 752]), array([ 8,  9, 10, 11, 12, 13, 14,  9, 10, 11, 12, 13, 14,  8,  9, 10, 11,
       12, 13, 14]))
>>> a[193]
array([193, 828, 680, 724, 810, 476,  50, 889, 731, 613, 233, 823, 984,
       770, 122], dtype=int32)
>>> b[193]
array([193, 828, 680, 724, 810, 476,  50, 889, 613, 233, 823, 984, 770,
       122, 429], dtype=int32)

Notice how only a small number of rows in a and b are affected, and only columns 8 to the end or 9 to the end.

Fabulous library, by the way!

The text was updated successfully, but these errors were encountered:

brj0 · 2024-07-29T08:36:01Z

The query function is not entirely random. It begins by initializing the nearest neighbor matrix with the leaves of the search tree. If fewer than k neighbors are found, it then adds random nodes to ensure sufficient neighbors. Afterward, the nearest neighbors are iteratively refined.
See: https://github.com/brj0/nndescent/blob/main/src/nnd.h#L852-L863

Wainberg · 2024-07-29T16:05:49Z

Ah gotcha! So the fix would be to use a separate random number generator per thread, seeded at the beginning of each call to query(). Having the ability to make the output fully deterministic is crucial for scientific applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NNDescent::query is very slightly non-deterministic when run in parallel #3

NNDescent::query is very slightly non-deterministic when run in parallel #3

Wainberg commented Jul 5, 2024

brj0 commented Jul 29, 2024

Wainberg commented Jul 29, 2024

NNDescent::query is very slightly non-deterministic when run in parallel #3

NNDescent::query is very slightly non-deterministic when run in parallel #3

Comments

Wainberg commented Jul 5, 2024

brj0 commented Jul 29, 2024

Wainberg commented Jul 29, 2024