Skip to content

Commit

Permalink
Update liger-vignette with new functionality
Browse files Browse the repository at this point in the history
  • Loading branch information
vkozareva committed Jan 22, 2019
1 parent d13a488 commit db8bdbc
Show file tree
Hide file tree
Showing 2 changed files with 170 additions and 26 deletions.
109 changes: 96 additions & 13 deletions vignettes/liger-vignette.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Comparing and contrasting heterogeneous single cell profiles using liger"
author: "Joshua D. Welch"
author: "Joshua D. Welch and Velina Kozareva"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
Expand All @@ -16,21 +16,42 @@ knitr::opts_chunk$set(
)
```

## Data Preprocessing
## Data preprocessing
The algorithm takes a list of two or more digital gene expression (DGE) matrices as input. Genes should be in rows and cells in columns.
Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell,
select variable genes, and scale the data. Note that we do not center the data because nonnegative matrix factorization accepts only
positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that coresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.
positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that corresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.
```r
dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is accepted.
dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is
recommended.
dge2 = readRDS("dge2.RDS")
ligerex = createLiger(list(name1 = dge1, name2 = dge2)) #Can also pass in more than 2 datasets
ligerex = normalize(ligerex)
ligerex = selectGenes(ligerex, var.thresh = 0.1)
ligerex = scaleNotCenter(ligerex)

```

## Performing the Factorization
### Loading 10X Data
`liger` also has functions for reading data generated by 10X's `cellranger count` pipeline (including
from 10X V3). We can merge data by data type (most commonly Gene Expression) across multiple samples and
then use this as a single dataset in a new object for integration.
```r
# 10X data to be combined for sample 1
sample1_data1 = "/path/to/10X/output/dir1"
sample1_data2 = "/path/to/10X/output/dir2"
sample1_dge = read10X(sample.dirs = list(sample1_data1, sample1_data2),
sample.names = c("s1_data1", "s1_data2"), min.umis = 500)
# 10X data for sample 2
sample2_data = "/path/to/10X/output/dir3"
sample2_dge = read10X(sample.dirs = list(sample2_data),
sample.names = c("s2_data"), min.umis = 500)
liger10X = createLiger(list(sample1 = sample1_dge, sample2 = sample2_dge))
# continue with other preprocessing steps

```

## Performing the factorization
Next we perform the factorization using an alternating least squares algorithm. After performing the factorization,
we identify cells that load on corresponding cell factors and quantile normalize their factor loadings across datasets.
The key parameters here are the number of factors (k), the penalty parameter (lambda), and the clustering resolution.
Expand All @@ -41,37 +62,84 @@ ligerex = quantileAlignSNF(ligerex) #SNF clustering and quantile alignment
```

## Visualizing the results
We can visualize the results by using dimensionality reduction techniques like t-SNE or UMAP (recommended
for larger datasets). Visualizations can be colored by dataset of origin or cluster assignment.
`plotWordClouds` is a useful way to visualize the most highly loading genes (both shared and dataset
specific) for each factor, in conjunction with the factor loadings across cells.
```r
ligerex = runTSNE(ligerex)
# for larger datasets, may want to use UMAP instead
ligerex = runUMAP(ligerex)
plotByDatasetAndCluster(ligerex) #Can also pass in different set of cluster labels to plot

pdf("word_clouds.pdf")
plotWordClouds(ligerex)
dev.off()

```

## Exploring factors and clusters
Another way to examine factor loadings across cells, and to help visualize the alignment process is
through `plotFactors`; this can be helpful for seeing significant scale differences between the
datasets. We can also visualize the correspondence between clusters and factors in the data with
`plotClusterProportions` and `plotClusterFactors`. These can be especially useful for identifying
clusters associated with certain factors and corresponding marker genes.
```r
plotFactors(ligerex)
plotClusterProportions(ligerex)
plotClusterFactors(ligerex, use.aligned = T)
```

## Finding marker genes
We can use the factorization to identify shared and dataset-specific markers. The function below returns a list,
where the first element contains dataset-specific markers for dataset 1, the second element contains shared
markers, the third element contains dataset-specific markers for dataset 2, and the last 2 elements indicate
the number of factors in which each marker is found. This information allows the identification
of ubiquitous vs. cell-type-specific dataset differences.
## Finding and visualizing marker genes
We can use the factorization to more explicitly identify shared and dataset-specific markers. The function `getFactorMarkers`
returns a list, where the first element contains dataset-specific markers for dataset 1, the second
element contains sharedmarkers, the third element contains dataset-specific markers for dataset 2,
and the last 2 elements indicate the number of factors in which each marker is found. This information
allows the identification of ubiquitous vs. cell-type-specific dataset differences.
```r
markers = getFactorMarkers(ligerex, num.genes = 10)
plotGene(ligerex, gene = "Malat1")
plotGeneViolin(ligerex, gene = "Malat1")
```

## Comparing different cluster assignments
We can compare and visualize `liger` cluster assignments with existing clusterings (if cluster assignments
are available for the individual datasets).
```r
# published cluster assignments for all cells in dataset 1 and 2
clusters_published1 = readRDS("clusters1.RDS")
clusters_published2 = readRDS("clusters2.RDS")
# calculate adjusted Rand Index between liger cluster assignments and another assignment
calcARI(ligerex, c(clusters_published1, clusters_published2))
# calculate purity between liger cluster assignments and another assignment for just dataset 1
calcPurity(ligerex, clusters_published1)
# visualize joint cluster assignment as related to individual dataset cluster assignments
makeRiverplot(ligerex, clusters_published1, clusters_published2)
```

## Checking dataset alignment and individual dataset distortion
`liger` includes methods for estimating the level of integration (alignment) between datasets and
the level of distortion of structure for each individual dataset after factorization and alignment
(compared to before). In general, datasets which have been factorized and aligned in a meaningful way
should show a high degree of integration (high alignment metric), while maintaining a low degree of distortion
(high agreement metric).
```r
calcAlignment(ligerex)
calcAgreement(ligerex)
# see if certain clusters are more integrated than others
calcAlignmentPerCluster(ligerex)
```

## Selecting k and lambda
The suggestK and suggestLambda functions can aid in selecting k and lambda. We want to find the smallest
The `suggestK` and `suggestLambda` functions can aid in selecting k and lambda. We want to find the smallest
k for which the increase in entropy metric begins to level off (an "elbow" in the plot). Similarly, we want the
smallest lambda for which the alignment metric stabilizes.
```r
suggestK(ligerex) # plot entropy metric to find an elbow that can be used to select the number of factors
suggestLambda(ligerex, k) # plot alignment metric to find an elbow that can be used to select the value of lambda
```

## Updating the Factorization
## Updating the factorization
If we want to add new data, change k or lambda, or re-analyze a subset of the data, the functions
below provide an efficient method of updating. This is much faster than the naive approach of
simply re-running the optimizeALS algorithm.
Expand All @@ -86,3 +154,18 @@ ligerex = optimizeNewData(ligerex, newdata = list(name3 = dge3, name4 = dge4),
#cell.subset is a list of cells to retain from each dataset
ligerex = optimizeSubset(ligerex, cell.subset)
```

## Integration between liger and Seurat
We can easily create `liger` objects from Seurat v2 objects (and vice versa), while keeping calculated
features from the original objects.
```r
# Create liger object from two separate Seurat objects, keeping union of top 2000 highly variable
# genes from each object
ligerex = seuratToLiger(list(seurat1, seurat2), names = c('name1', 'name2'), num.hvg.info = 2000)
# Create liger object from single integrated Seurat object, keeping CCA factorization,
# splitting datasets by Seurat meta.var column "original"
ligerex = seuratToLiger(seurat_obj, meta.var = 'original', cca.to.H = T)
# Create Seurat object from liger object, each dataset
seurat_obj = ligerToSeurat(ligerex)

```
Loading

0 comments on commit db8bdbc

Please sign in to comment.