Update liger-vignette with new functionality

welch-lab · Jan 22, 2019 · db8bdbc · db8bdbc
1 parent d13a488
commit db8bdbc
Show file tree

Hide file tree

Showing 2 changed files with 170 additions and 26 deletions.
diff --git a/vignettes/liger-vignette.Rmd b/vignettes/liger-vignette.Rmd
@@ -1,6 +1,6 @@
 ---
 title: "Comparing and contrasting heterogeneous single cell profiles using liger"
-author: "Joshua D. Welch"
+author: "Joshua D. Welch and Velina Kozareva"
 date: "`r Sys.Date()`"
 output: rmarkdown::html_vignette
 vignette: >
@@ -16,21 +16,42 @@ knitr::opts_chunk$set(
 )
 ```
 
-## Data Preprocessing
+## Data preprocessing
 The algorithm takes a list of two or more digital gene expression (DGE) matrices as input. Genes should be in rows and cells in columns.
 Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell,
 select variable genes, and scale the data. Note that we do not center the data because nonnegative matrix factorization accepts only
-positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that coresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.
+positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that corresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.
 ```r
-dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is accepted.
+dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is
+recommended.
 dge2 = readRDS("dge2.RDS")
 ligerex = createLiger(list(name1 = dge1, name2 = dge2)) #Can also pass in more than 2 datasets
 ligerex = normalize(ligerex)
 ligerex = selectGenes(ligerex, var.thresh = 0.1)
 ligerex = scaleNotCenter(ligerex)
+
 ```
 
-## Performing the Factorization
+### Loading 10X Data
+`liger` also has functions for reading data generated by 10X's `cellranger count` pipeline (including
+from 10X V3). We can merge data by data type (most commonly Gene Expression) across multiple samples and 
+then use this as a single dataset in a new object for integration. 
+```r
+# 10X data to be combined for sample 1 
+sample1_data1 = "/path/to/10X/output/dir1"
+sample1_data2 = "/path/to/10X/output/dir2"
+sample1_dge = read10X(sample.dirs = list(sample1_data1, sample1_data2), 
+                      sample.names = c("s1_data1", "s1_data2"), min.umis = 500)
+# 10X data for sample 2
+sample2_data = "/path/to/10X/output/dir3"
+sample2_dge = read10X(sample.dirs = list(sample2_data), 
+                      sample.names = c("s2_data"), min.umis = 500)
+liger10X = createLiger(list(sample1 = sample1_dge, sample2 = sample2_dge))
+# continue with other preprocessing steps
+
+```
+
+## Performing the factorization
 Next we perform the factorization using an alternating least squares algorithm. After performing the factorization,
 we identify cells that load on corresponding cell factors and quantile normalize their factor loadings across datasets.
 The key parameters here are the number of factors (k), the penalty parameter (lambda), and the clustering resolution.
@@ -41,37 +62,84 @@ ligerex = quantileAlignSNF(ligerex) #SNF clustering and quantile alignment
 ```
 
 ## Visualizing the results
+We can visualize the results by using dimensionality reduction techniques like t-SNE or UMAP (recommended 
+for larger datasets). Visualizations can be colored by dataset of origin or cluster assignment. 
+`plotWordClouds` is a useful way to visualize the most highly loading genes (both shared and dataset
+specific) for each factor, in conjunction with the factor loadings across cells.  
 ```r
 ligerex = runTSNE(ligerex)
+# for larger datasets, may want to use UMAP instead
+ligerex = runUMAP(ligerex)
 plotByDatasetAndCluster(ligerex) #Can also pass in different set of cluster labels to plot
+
 pdf("word_clouds.pdf")
 plotWordClouds(ligerex)
 dev.off()
+
+```
+
+## Exploring factors and clusters
+Another way to examine factor loadings across cells, and to help visualize the alignment process is 
+through `plotFactors`; this can be helpful for seeing significant scale differences between the 
+datasets. We can also visualize the correspondence between clusters and factors in the data with 
+`plotClusterProportions` and `plotClusterFactors`. These can be especially useful for identifying 
+clusters associated with certain factors and corresponding marker genes. 
+```r
+plotFactors(ligerex)
+plotClusterProportions(ligerex)
+plotClusterFactors(ligerex, use.aligned = T)
 ```
 
-## Finding marker genes
-We can use the factorization to identify shared and dataset-specific markers. The function below returns a list,
-where the first element contains dataset-specific markers for dataset 1, the second element contains shared
-markers, the third element contains dataset-specific markers for dataset 2, and the last 2 elements indicate 
-the number of factors in which each marker is found. This information allows the identification
-of ubiquitous vs. cell-type-specific dataset differences.
+## Finding and visualizing marker genes
+We can use the factorization to more explicitly identify shared and dataset-specific markers. The function `getFactorMarkers`
+returns a list, where the first element contains dataset-specific markers for dataset 1, the second 
+element contains sharedmarkers, the third element contains dataset-specific markers for dataset 2, 
+and the last 2 elements indicate the number of factors in which each marker is found. This information 
+allows the identification of ubiquitous vs. cell-type-specific dataset differences.
 ```r
 markers = getFactorMarkers(ligerex, num.genes = 10)
 plotGene(ligerex, gene = "Malat1")
 plotGeneViolin(ligerex, gene = "Malat1")
 ```
 
+## Comparing different cluster assignments 
+We can compare and visualize `liger` cluster assignments with existing clusterings (if cluster assignments
+are available for the individual datasets). 
+```r
+# published cluster assignments for all cells in dataset 1 and 2
+clusters_published1 = readRDS("clusters1.RDS")
+clusters_published2 = readRDS("clusters2.RDS")
+# calculate adjusted Rand Index between liger cluster assignments and another assignment
+calcARI(ligerex, c(clusters_published1, clusters_published2)) 
+# calculate purity between liger cluster assignments and another assignment for just dataset 1 
+calcPurity(ligerex, clusters_published1)
+# visualize joint cluster assignment as related to individual dataset cluster assignments
+makeRiverplot(ligerex, clusters_published1, clusters_published2)
+```
+
+## Checking dataset alignment and individual dataset distortion
+`liger` includes methods for estimating the level of integration (alignment) between datasets and 
+the level of distortion of structure for each individual dataset after factorization and alignment
+(compared to before). In general, datasets which have been factorized and aligned in a meaningful way
+should show a high degree of integration (high alignment metric), while maintaining a low degree of distortion 
+(high agreement metric).
+```r
+calcAlignment(ligerex)
+calcAgreement(ligerex)
+# see if certain clusters are more integrated than others
+calcAlignmentPerCluster(ligerex)
+```
 
 ## Selecting k and lambda
-The suggestK and suggestLambda functions can aid in selecting k and lambda. We want to find the smallest
+The `suggestK` and `suggestLambda` functions can aid in selecting k and lambda. We want to find the smallest
 k for which the increase in entropy metric begins to level off (an "elbow" in the plot). Similarly, we want the
 smallest lambda for which the alignment metric stabilizes.
 ```r
 suggestK(ligerex) # plot entropy metric to find an elbow that can be used to select the number of factors
 suggestLambda(ligerex, k) # plot alignment metric to find an elbow that can be used to select the value of lambda
 ```
 
-## Updating the Factorization
+## Updating the factorization
 If we want to add new data, change k or lambda, or re-analyze a subset of the data, the functions
 below provide an efficient method of updating. This is much faster than the naive approach of 
 simply re-running the optimizeALS algorithm.
@@ -86,3 +154,18 @@ ligerex = optimizeNewData(ligerex, newdata = list(name3 = dge3, name4 = dge4),
 #cell.subset is a list of cells to retain from each dataset
 ligerex = optimizeSubset(ligerex, cell.subset) 
 ```
+
+## Integration between liger and Seurat
+We can easily create `liger` objects from Seurat v2 objects (and vice versa), while keeping calculated 
+features from the original objects. 
+```r
+# Create liger object from two separate Seurat objects, keeping union of top 2000 highly variable 
+# genes from each object
+ligerex = seuratToLiger(list(seurat1, seurat2), names = c('name1', 'name2'), num.hvg.info = 2000)
+# Create liger object from single integrated Seurat object, keeping CCA factorization, 
+# splitting datasets by Seurat meta.var column "original"
+ligerex = seuratToLiger(seurat_obj, meta.var = 'original', cca.to.H = T)
+# Create Seurat object from liger object, each dataset
+seurat_obj = ligerToSeurat(ligerex)
+
+```