You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have recently started using Pynndescent to create nearest neighbor graphs. It performs much faster and is more stable than sklearn nearest neighbor graph alternative.
The problem I am trying to solve is about scale: I generate a data-set whose columns are used to compute nearest neighbor graphs which means that for a 500 x 50 matrix (the example below) I have to compute 50*(50-1)/2 nearest neighbor graphs which is 1225 calls to pynndescent. This takes 9 seconds on my machine which is very limiting because I would like to do this for significantly larger data-sets (i.e. N = 1000 as opposed to 50 ).
Are there ways to optimize and do something else than a for loop? I serialize instead of the naive nested loop but python threading and multiprocessing haven't produced better result than the solution below.
data = np.random.normal(0,1, (500,50) )
idx = list(itertools.combinations( range(data.shape[0]),2))
count = len(idx)
s = np.empty( N , dtype=np.float16 )
for i in range(count):
s[i] = NNDescent(data[:,idx[i]], metric='euclidean', n_neighbors = 5).neighbor_graph[1].sum()
The text was updated successfully, but these errors were encountered:
lyelibi
changed the title
nested loop sequential or parrelization runtime
nested loop sequential or parallelization runtime
Jun 3, 2022
Off the top of my head I don't see any obvious ways to improve this significantly. There is just a lot of computation work to be done, and I'm not sure there are ways around that. You would have to look into deeper algorithmic approaches to cut down on work I think.
I have recently started using Pynndescent to create nearest neighbor graphs. It performs much faster and is more stable than sklearn nearest neighbor graph alternative.
The problem I am trying to solve is about scale: I generate a data-set whose columns are used to compute nearest neighbor graphs which means that for a 500 x 50 matrix (the example below) I have to compute 50*(50-1)/2 nearest neighbor graphs which is 1225 calls to pynndescent. This takes 9 seconds on my machine which is very limiting because I would like to do this for significantly larger data-sets (i.e. N = 1000 as opposed to 50 ).
Are there ways to optimize and do something else than a for loop? I serialize instead of the naive nested loop but python threading and multiprocessing haven't produced better result than the solution below.
The text was updated successfully, but these errors were encountered: