Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible problem using umap_transform with nearest neighbor distance #4972

Closed
jlmelville opened this issue Aug 22, 2021 · 4 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@jlmelville
Copy link

This may be related to #4647: I recently discovered a documentation issue with umap_transform. This might affect RunUMAP.default, where it calls umap_transform with X = NULL and nn_method = object:

if (is.list(x = object)) {
uwot::umap_transform(
X = NULL,
nn_method = object,
model = model,
n_threads = nbrOfWorkers(),
n_epochs = n.epochs,
verbose = verbose
)

I am not sure what object is. But the documentation for umap_transform was misleading. To be clear, the pre-computed nearest neighbor data that can be passed to the nn_method parameter should consist of:

  • 1 row for every item that is to be transformed. For example, if you have 50 items you want to add to an existing UMAP plot, there should be 50 rows in the nearest neighbor matrices.
  • The neighbors that are referred to are the items that generated the original UMAP model/plot (i.e. the data that was input to the umap function and is part of the model data), not the new data to be transformed.

For example, if the first row of object$idx looks like:

   [,1] [,2] [,3] [,4]
1     1    5    8   10

that means the 4 nearest neighbors to the first item you want to add to the existing embedding can be found in rows 1, 5, 8 and 10 of the matrix/data-frame passed to the umap function's X parameter, not the data you want to transform.

Another way to put this is if you want to use the nn_method parameter this way, you must:

  • Build some kind of nearest neighbor index using some training data (the same data you would pass to umap).
  • Query the training data index with your test data (this is the data you would pass to umap_transform).

Hopefully I am not belaboring the point, but it's tricky to describe the difference and the documentation was misleading. I am now fixing the documentation and adding more validation to umap_transform. I also don't actually know if there is any problem to solve here: maybe the object is correctly constructed. If not, then please be aware that:

  • The current behavior is wrong and will give bad results when using the pre-computed nearest neighbors.
  • It might be what caused the memory error in ProjectUMAP memory error? #4647.
  • The code path at
    umap_transform(
    is unaffected and should be correct.
  • Just some advance warning: as a result of adding more validation to umap_transform a new release of uwot to CRAN may cause test errors and existing workflows which used to do the wrong thing silently will start (correctly) failing. I am unlikely to even start the reverse dependency checking as part of the CRAN release workflow for a few weeks, however.

If this turns out to be a real problem, my apologies to you and your users for this error.

Finally, a small FYI about:

if (metric == 'correlation') {
warning(
"UWOT does not implement the correlation metric, using cosine instead",
call. = FALSE,
immediate. = TRUE
)
metric <- 'cosine'
}

uwot has supported the (Pearson) correlation metric since 0.1.9 (released on CRAN in November 2020).

@jlmelville jlmelville added the bug Something isn't working label Aug 22, 2021
@yuhanH
Copy link
Collaborator

yuhanH commented Sep 3, 2021

Hi, James
Thanks for checking our RunUMAP function!

if (is.list(x = object)) {
uwot::umap_transform(
X = NULL,
nn_method = object,
model = model,
n_threads = nbrOfWorkers(),
n_epochs = n.epochs,
verbose = verbose
)

The model is a UMAP model from the reference data.
The object here is a list which contains the knn neighbors index and distance. The names of these two elements are idx and dist. For example,

  • object$idx[1,1:10] looks like
 29793  9809  5072 11168  8622 33114 24179  4757 33502 14913
  • object$dist[1,1:10] looks like
0.2638129 0.2749854 0.2811905 0.2918375 0.2930307 0.3000401 0.3069038 0.3109550 0.3154789 0.3166593

They fit the input format of nn_method. In the new release of uwot. Will the input format of nn_method be changed?

Technically, we calculated object, the knn list, from our ProjectUMAP function. It is a knn neighbors between query data and reference data (original data for the umap model).

query.neighbor <- FindNeighbors(
object = reference[, reference.dims],
query = query[, query.dims],
k.param = k.param,
nn.method = nn.method,
n.trees = n.trees,
annoy.metric = annoy.metric,
cache.index = cache.index,
index = index,
return.neighbor = TRUE,
l2.norm = l2.norm
)

Also, thanks for pointing out correlation metric in uwot. We will fix this soon.

@yuhanH yuhanH added enhancement New feature or request and removed bug Something isn't working labels Sep 3, 2021
@yuhanH yuhanH self-assigned this Sep 3, 2021
@jlmelville
Copy link
Author

Thanks for looking at this @yuhanH.

In the new release of uwot. Will the input format of nn_method be changed?

No there will just be a documentation change.

Technically, we calculated object, the knn list, from our ProjectUMAP function. It is a knn neighbors between query data and reference data (original data for the umap model).

Ok that’s good to know. That is the correct data format, no changes need to be made in Seurat, and existing workflows are also giving the correct results. The concern I raised in this issue is not actually a problem.

@yuhanH
Copy link
Collaborator

yuhanH commented Sep 7, 2021

Sounds great! Thanks.

@yuhanH
Copy link
Collaborator

yuhanH commented Sep 10, 2021

Uwot correlation metric is added in the develop branch version 4.0.4.9002
Thanks!

@yuhanH yuhanH closed this as completed Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants