Skip to content
This repository has been archived by the owner on Aug 17, 2024. It is now read-only.

Collection of tools for the analysis of CpG data

License

Notifications You must be signed in to change notification settings

RhettRautsaw/Walter

 
 

Repository files navigation

NOTICE

🚨🚧 This page was created to improve pb-CpG-tools run times and issues with concurrent.futures. These issues have been addressed in the latest versions of pb-CpG-tools; therefore, this page is now archived and users should refer to pb-CpG-tools for methylation analysis. 🚧🚨

Walter

Today, we cook some High Fidelity Meth-ylation scores. This is a collection of tools for the analysis of CpG/5mC data from PacBio HiFi reads aligned to a reference genome (e.g., an aligned BAM).

This is a modified version of PacBio's pb-CpG-tools; whereby, I broke apart their aligned_bam_to_cpg_scores.py script to make use of GNU-Parallel rather than concurrent.futures within Python. I also took the opportunity to rename the scripts to be Breaking Bad themed, because why not...

The Scripts

  • Walter.py: Walter is the brains of the operation...gathering data, making decisions.
  • Heisenberg.py: Heisenberg is doing the behind the scenes cooking of some (Hi)gh (Fi)delity (Meth)ylation scores.

Together these two scripts obtain a list of CpG/5mC sites and modification probabilities from a BAM file of HiFi reads aligned to a reference genome.

Requirements

To use these tools, the HiFi reads should already contain 5mC base modification tags, generated on-instrument or by using primrose. The aligned BAM should also be sorted and indexed.

Environment

There are several Python packages required to run the Walter.py script. These can easily be installed using conda and the conda_env_cpg.yaml file provided:

# create conda environment
conda env create -f conda_env_cpg.yaml

# activate environment
conda activate cpg

# Install GNU-Parallel
conda install parallel

These packages could also be installed manually. However, the specific package versions listed in the yaml file must be used to ensure a stable environment.

Input BAM

The input BAM should be sorted and have an associated index file (.bai or .csi). The HiFi reads in the aligned input BAM file must contain the SAM tags MM and ML which encode base modifications. The tags are described in the SAM tag specification document. These are the default tags generated by primrose for HiFi CpG datasets.

Tag Type Description
MM Z Base modifications / methylation
ML B,C Base modification probabilities

Notes for ML: The continuous probability range of 0.0 to 1.0 is remapped to the discrete integers 0 to 255 inclusively. The probability range corresponding to an integer N is N/256 to (N + 1)/256.

These tools are explicitly designed for 5mC tags (e.g., MM:Z:C+m) and do not support other modification types or tags in multiple-modification format.

Usage

Usage:
  python Walter.py -b input.bam -f ref.fasta -o label [options]

Mandatory arguments:
  -b, --bam             FILE    The aligned BAM file.
  -f, --fasta           FILE	The reference fasta file.
  -o, --output_label    STR     Label for output files, which results in [label].bed/bw.
  
Optional arguments:
  -p, --pileup_mode     count/model    Use a model-based approach to score modifications across sites 
                                       (model) or a simple count-based approach (count). [count]
  -d, --model_dir       STR     Full path to the directory containing the model (*.pb files) to 
                                load. [None]
  -m, --modsites        denovo/reference    Only output CG sites with a mod probability > 0 (denovo), 
                                            or output all CG sites based on the supplied reference 
                                            fasta (reference). [denovo]
  -c, --min_coverage    INT     Minimum coverage required for filtered outputs. [4]
  -q, --min_mapq        INT     Ignore alignments with MAPQ < N. [0]
  -a, --hap_tag         STR     The SAM tag containing haplotype information. [HP]
  -s, --chunksize       INT     Break reference regions into chunks of this size for parallel 
                                processing. [500000]
  -t, --threads         INT     Number of threads for parallel processing. [1]
  -h, --help                    Show this help message and exit.

Details

The -p, --pileup_mode selects the method used to calculate modification probabilities.

  • model: This approach is preferred. It uses distributions of modification scores and a machine-learning model to calculate the modification probabilities across CG sites. Using this option requires providing the path to the directory containing the model files with the -d, --model_dir flag. The required model is available in the pileup_calling_model/ directory.
  • count: This method is simplistic. For a given site, the number of reads with a modification score of >0.5 and <0.5 are counted and the modification probability is given as a percentage.

The -m, --modsites flag determines which sites will be reported.

  • denovo: This option will identify and output all CG sites found in the consensus sequence from the reads in the pileup. This allows reporting of CG sites with zero modification probability. This mode does not ensure that the reference also displays a CG site (e.g., there could be sequence mismatches between the reads and reference).
  • reference: This option will identify and output all CG sites found in the reference sequences. This allows reporting of CG sites with zero modification probability. This mode does not ensure that aligned reads also display a CG site (e.g., there could be sequence mismatches between the reads and reference).

Using the -a, --hap_tag flag allows an arbitrary SAM tag to be used to identify haplotypes, rather than the default HP tag. The haplotype values must be 0, 1, and 2, where 0 is not assigned/ambiguous.

Outputs

There are bed format (.bed) and bigwig format (.bw) files generated for the complete read set and each separate haplotype (when available). Per-read information is also generated as a tab-delimited file (note that these are the raw-calls from Primrose and have not been modeled)

In addition, there are coverage filtered version of these files produced that are based on the minimum coverage value set for -c, --min_coverage.

Two log files are also produced ([label].Walter.log & [label].Heisenberg.log), which contains useful information.

If haplotype information is absent, the following 4 files are expected:

[label].Heisenberg.reads.gz
[label].Walter.total.[pileup_mode].bed
[label].Walter.total.[pileup_mode].bw
[label].Walter.total.[pileup_mode].mincov[X].bed
[label].Walter.total.[pileup_mode].mincov[X].bw

If haplotype information is present, the following 12 files are expected:

[label].Heisenberg.reads.gz
[label].Walter.total.[pileup_mode].bed
[label].Walter.hap1.[pileup_mode].bed 
[label].Walter.hap2.[pileup_mode].bed
[label].Walter.total.[pileup_mode].bw
[label].Walter.hap1.[pileup_mode].bw 
[label].Walter.hap2.[pileup_mode].bw
[label].Walter.total.[pileup_mode].mincov[X].bed
[label].Walter.hap1.[pileup_mode].mincov[X].bed 
[label].Walter.hap2.[pileup_mode].mincov[X].bed
[label].Walter.total.[pileup_mode].mincov[X].bw
[label].Walter.hap1.[pileup_mode].mincov[X].bw 
[label].Walter.hap2.[pileup_mode].mincov[X].bw

The bed file columns will differ between the -p model and -p count methods, but both share the first six columns:

  1. reference name
  2. start coordinate
  3. end coordinate
  4. modification probability
  5. haplotype
  6. coverage

For -p count, four additional columns are present:

  1. modified site count
  2. unmodified site count
  3. avg mod score
  4. avg unmod score

For -p model, three additional columns are present:

  1. estimated modified site count (extrapolated from model modification probability)
  2. estimated unmodified site count (extrapolated from model modification probability)
  3. discretized modification probability (calculated from estimated mod/unmod site counts)

The bigwig files are in an indexed binary format and contain columns 1-4 listed above, and are preferred for loading CpG/5mC tracks in IGV.

Example Data

An aligned BAM file containing HiFi reads with 5mC tags (HG002.GRCh38.haplotagged.bam) is freely available for download: https://downloads.pacbcloud.com/public/dataset/HG002-CpG-methylation-202202/

The sample is HG002/NA24385 from the Human Pangenome Reference Consortium HG002 Data Freeze v1.0, and is aligned to GRCh38 (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). There are also four unaligned bam files containing the HiFi reads.

Performance

Walter & Heisenberg are also much faster than the original pb-CpG-tools aligned_bam_to_cpg_scores.py script. Shown below is a table with the original and updated runtimes.

The current -s, --chunksize default of 500,000 requires 1-3 Gb memory per thread with a 30X coverage aligned BAM. A higher coverage dataset would use more memory per thread.

Benchmarks were run using the HG002.GRCh38.haplotagged.bam dataset described in the above section. Peak memory was estimated using 3Gb per thread.

Threads
(-t)
Chunk Size
(-s)
Original Wall Time (h:m:s) Walter Wall Time (h:m:s) Estimated Peak Memory
48 500000 (default) 3:15:56 2:23:00 144 Gb
36 500000 (default) 3:58:06 2:40:00 108 Gb
24 500000 (default) 5:19:03 3:24:03 72 Gb
12 500000 (default) 8:58:01 6:34:54 36 Gb

Changelog

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

About

Collection of tools for the analysis of CpG data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 75.6%
  • PureBasic 24.4%