Add files via upload

Updated parameter selection vignette
welch-lab · Dec 6, 2021 · 717b5b9 · 717b5b9
1 parent b0271e0
commit 717b5b9
Show file tree

Hide file tree

Showing 2 changed files with 145 additions and 6 deletions.
diff --git a/vignettes/Parameter_selection.Rmd b/vignettes/Parameter_selection.Rmd
@@ -0,0 +1,138 @@
+
+---
+title: "Defining Optimal Heuristic Parameters"
+author: "April Kriebel and Joshua Welch"
+date: "12/6/2021"
+output: html_document
+---
+
+
+LIGER has two main free parameters: lambda and K.  While the default values of `Lambda = 5` and `K = 30` have proven efficacy as default parameters, a user may wish to optimize these parameters for their particular analysis. In this vignette, we demonstrate how such optimal parameter selection can be performed. We use subset the integrated datasets such that the analysis time is practical for a vignette, and perform the intergration on a spatial transcriptomic dataset (osmFISH) and scRNA-seq dataset (DROPviz).
+The data pre-processing steps are identical for optimizing both parameters. 
+
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+osmfish = readRDS("C:/Users/april/OneDrive/Documents/UINMF_Final_backups/Vignette/Parameter_selection/Downsampled.osmFISH.RDS")
+rna = readRDS("C:/Users/april/OneDrive/Documents/UINMF_Final_backups/Vignette/Parameter_selection/Downsampled.DropViz.RDS")
+
+library(rliger)
+
+```
+```{r eval = FALSE}
+install.packages('devtools')
+library(devtools)
+install_github('welch-lab/liger')
+library(rliger)
+```
+
+# Step 1: Preprocessing and Normalization
+
+First, read in your datasets. For this tutorial, we use a downsampled osmFISH dataset (33 genes by 2,000 cells), and a downsampled single-cell RNA-seq dataset (29,463 genes by 2,000 cells). The datasets can be downloaded at https://www.dropbox.com/sh/y9kjoum8u469nj1/AADik2b2-Qo3os2QSWXdIAbna?dl=0 .
+
+```{r eval = FALSE}
+osmfish = readRDS("Downsampled.osmFISH.RDS")
+rna = readRDS("Downsampled.DropViz.RDS")
+```
+
+Next, create your Liger object, submitting the datasets in list format. The unshared features should not be subsetted out, or submitted separately. Rather, they should be included in the matrix submitted for that dataset. For example,the scRNA-seq data is submitted in its entirety, the unshared features are not submitted separately. This helps ensure proper normalization.
+```{r createliger, warning = FALSE, message = FALSE, results = FALSE}
+osm.liger <- createLiger(list(osmFISH = osmfish, rna = rna))
+```
+Normalize the datasets.The normalization is applied to the datasets in their entirety.
+
+```{r normalize}
+osm.liger <- normalize(osm.liger)
+```
+To include unshared features in your analysis, set the `unshared` parameter to TRUE when selecting variable genes. To select the unshared features, it is necessary to include a list of what datasets the unshared features should be included from. For instance, in this case, we wish to include the unshared features from the RNA dataset, the second dataset in our analysis. Therefore, to use the unshared features from dataset 2, use the parameter `unshared.datasets = list (1)`. If the user wishes to include unshared feature sets from both datasets, the appropriate parameter setting is `unshared.datasets = list(1,2)`. We provide an individual threshold parameter for selecting unshared features: `unshared.thresh`. If a single value is submitted, that threshold is applied when selecting unshared features for all datasets. If the user wishes to submit different thresholds to select features from each dataset, the user can specify an individual thresholds for each dataset by submitting a list of thresholds the same length as the number of datasets with unshared datasets. The variable unshared features will be stored in `[email protected]`.
+
+```{r selectGenes, warning= FALSE}
+osm.liger <- selectGenes(osm.liger, unshared = TRUE, unshared.datasets = list(2), unshared.thresh= 0.4)
+```
+
+
+The scaleNotCenter functions will scale both the shared and unshared features. The scaled unshared features will be stored 
+in `[email protected]`
+
+```{r scalenotcenter, results = FALSE}
+osm.liger <- scaleNotCenter(osm.liger)
+```
+
+The selection of optimal `K` and `Lambda` parameters will be influenced by the number of variable genes used, as well as the number of cells present in each dataset. Although we run these analysis on downsampled datasets for the ease and convenience of our users, when performing these analysis on your own data, you will want to use the full set of data. 
+
+## Selecting an appropriate K-Value
+While the default value of `K = 30` has been shown to produce excellent results in a wide variety of scenarios, the user may wish to optimize the `K` parameter for their particular analysis. To explore optimum K-values for a specific dataset integration, we calculate the alignment scores of the integration across a variety of K-values. We assume that the user has some familiarity with the LIGER pipeline; for more thorough explanations of the `optimizeALS`, `quantile_norm`, and `louvainCluster` functions, please see our other vignettes, especially http://htmlpreview.github.io/?https://github.com/welch-lab/liger/blob/master/vignettes/UINMF_vignette.html . 
+Because UINMF, as well as iNMF, are not guaranteed to converge to a global minimum, we three random seeds at each value of K in order create a more general understanding of data trends.
+
+```{r iterate_k, message = FALSE, results= FALSE}
+seed = sample(1:200, 3, replace = FALSE)
+alignment_score = list()
+for (iter in seed){
+K_values = c(10, 20, 30, 40, 50)
+for (i in K_values){
+  osm.liger <- optimizeALS(osm.liger,  lambda = 5 , use.unshared = TRUE, max.iters = 30, thresh=1e-10, k =i, rand.seed = iter)
+  osm.liger <- quantile_norm(osm.liger, ref_dataset = "rna")
+  osm.liger <- louvainCluster(osm.liger)
+  new_alignment = calcAlignment(osm.liger)
+  names(new_alignment) = paste0("Seed:", iter, "_K:",i)
+  alignment_score = append(alignment_score, new_alignment)
+}
+}
+```
+
+At the conclusion of the above loop, the list `alignment_score` has the calculated alignment scores for the integration across a variety of K values, and across three seeds.
+We want to plot the alignment score in order to observe general data trends, and will need to load `ggpubr`, `tidyr`, and `sjstats`.
+
+```{r visualizations, message = FALSE, warning = FALSE}
+library(ggpubr)
+library(sjstats)
+library(tidyr)
+align_df = data.frame(alignment_score)
+align_df = data.frame(t(align_df))
+colnames(align_df) = "Alignment_Score"
+align_df$details = rownames(align_df)
+align_df = separate(data = align_df , col = details, into = c("Seed", "K"), sep = "_")
+
+ggline(align_df, x = "K", y = "Alignment_Score", add = "mean_se", palette = "jco",lwd =2) +
+  xlab("K-Value") + ylab("Alignment Score") + ggtitle("Selecting K")
+
+```
+\
+\
+When selecting K, most researchers are looking to maximize `K`, while retaining a decent alignment score. 
+
+## Selecting an appropriate lambda value
+The default parameter of `lambda = 5` has previously worked well in a variety of analysis. However, to explore how the selection of lambda might be influencing your analysis, we can again calculate the alignment scores of the algorithm across a span of lambda values, gaining intuition on what lambda values are appropriate for a particular data analysis. 
+
+```{r iterate_lambda, message = FALSE, results = FALSE}
+seed = sample(1:200, 3, replace = FALSE)
+alignment_score = list()
+for (iter in seed){
+lambda_values = c(1,5,10,15,20)
+for (i in lambda_values){
+  osm.liger <- optimizeALS(osm.liger,  lambda = i , use.unshared = TRUE, max.iters = 30, thresh=1e-10, k =30, rand.seed = iter)
+  osm.liger <- quantile_norm(osm.liger, ref_dataset = "rna")
+  osm.liger <- louvainCluster(osm.liger)
+  new_alignment = calcAlignment(osm.liger)
+  names(new_alignment) = paste0("Seed:", iter, "_Lambda:",i)
+  alignment_score = append(alignment_score, new_alignment)
+}
+}
+```
+
+We can then visualize the algorithm's performance across a variety of lambda values.
+
+```{r visualizations_lambda, message = FALSE, warning = FALSE}
+align_df = data.frame(alignment_score)
+align_df = data.frame(t(align_df))
+colnames(align_df) = "Alignment_Score"
+align_df$details = rownames(align_df)
+align_df = separate(data = align_df , col = details, into = c("Seed", "Lambda"), sep = "_")
+
+ggline(align_df, x = "Lambda", y = "Alignment_Score", add = "mean_se", palette = "jco",lwd =2) +
+  xlab("Lambda Value") + ylab("Alignment Score") + ggtitle("Selecting Lambda")
+
+```
+\
+\
+We can see that largest increase in alignment score comes from having `lambda = 5`, which fits with our initial intuition of this being an appropriate default parameter. Further increases in alignment are more marginal, and extreme values of lambda should be chosen with care.