PRESM stands for Personalized Reference Editor for Somatic Mutation discovery. In contrast to other reference genome editor software that generate a diploid reference genome which may distribute the reads to two site, impairing the soundness of the downstream statistical framework, PRESM provides two haploid reference genomes. The pipeline of PRESM involves three steps: First, germline mutations are discovered by another tool, e.g., GATK, and are used to make personalized references to call somatic mutations. Second, a reference genome composed of all personal variants (including both heterozygous and homozygous sites) is used as “decoy” to capture the heterozygous variants in reads. Third, PRESM changes the reads by replacing all heterozygous alleles with the corresponding reference alleles and maps the modified reads back to another personalized reference genome that contains only homozygous changes. The output of this step is a BAM file ready for any somatic mutation callers to use. We intend to offer long-term maintenance for PRESM and continue adding our new functions into it.
PRESM is a batteries-included JAR executable; therefore no installation is needed aside from Java 8.
Please download the executable PRESM.jar from the latest release and run it using the standard command for Java packages:
java [–Xmx] –jar PRESM.jar
Building the project from source will require Apache Maven 3.6.1. First, clone the repository to a local folder.
git clone [email protected]:theLongLab/PRESM.git
The dependency JARs must then be downloaded separately and installed to the local repository.
mvn install:install-file \
-Dfile=path-to-jar \
-DgroupId=group-id \ # refer to pom.xml for groupId, artifactId, and version;
-DartifactId=artifact-id \ # alternatively, use own naming and change pom.xml to reflect as such
-Dversion=version \
-Dpackaging=jar \
Afterwards, compile the project and the JAR will be in the target/
folder.
mvn package
- Processing variants files generated by GATK, Pindel or other variant call software, i.e., combining two variant files that are for SNPs and indels respectively; selecting homozygous variants or heterozygous variants; removing variants with duplicated coordinates.
- Generating the personalized reference genome according to the germline mutations provided by the users.
- Generating the modified background database files according to personalized reference genomes, for example, the personalized dbSNP, db.Indel, and cosmic.vcf can be generated. (Several downstream somatic mutation callers require these files).
- Mapping the coordinates of somatic variants called by using personalized reference genome to the coordinates of universal reference genome.
- Replacing the alternative alleles with reference bases according to the heterozygous variants provided by the users.
All the functions are used as:
java [-Xmx] –jar /path/to/presm.jar <options>
CombineVariants: Combine two variant call files according to the reference genome.
> -F CombineVariants –R ref.fasta –variant1 input1.vcf –variant2 input2.vcf –O output.vcf
Parameters:
- –R: input the reference genome file.
- -variant1: input variant file 1 (in vcf foramt)
- -variant2: input variant file 2 (in vcf foramt)
- -O: output the combined variant call file in vcf format
SelectGenotype: Select homozygous or heterozygous variants in the variant call file provided by the users.
> -F SelectGenotype –genotype homo[heter] –variants input.vcf –O output.vcf
Parameters:
- -genotype: Specify the genotype of the variants (homozygous/ heterozygous variants)
- -variants: input the variants in vcf format
- -O: output the specified genotype variants in vcf format
RemoveOverlaps : Remove overlapping variants in a variant call file.
> -F RemoveOverlaps –R ref.fasta –variants input.vcf –O output.vcf
Parameters:
- –R: input the reference genome file
- -variants: input the variant in vcf format
- -O: output the duplicated variant in vcf format
SortVariants: Sort variants according to the reference genome coordinates.
> -F SortVariants –R ref.fasta –variants input.vcf –O output.vcf
Parameters:
- –R: input the reference genome file
- -variants: input the variant in vcf format
- -O: output the sorted variant in vcf format
MakePersonalizedReference: Generate personalized reference genome according to the germline mutations provided by the users.
> -F MakePersonalizedReference –I ref.fasta –germlinemutations input.vcf –O output.fa [–intervals input.intervals] [-genotype home/ heter]
Parameters:
- –I: input the reference genome file
- -germlinemutations: input the germline mutations in vcf format
- -O: output the personalized reference genome in fasta format
Options:
- -intervals: specify the region of variants
- -genotype: specify the genotype of variants
MakePersonalizedVariantsDB: Generate personalized variants database files according to the germline mutations provided by the users.
> -F MakePersonalizedVariants –I input.vcf –O output.vcf –variants variant.vcf [–intervals input.intervals] [-genotype home/ heter] [-removeduplicates]
Parameters:
- -I: input the variants database in vcf format
- -O: output the personalized variants database in vcf format
- -variants: input the mutations in vcf format
Options:
- -intervals: specify the region of variants
- -genotype: specify the genotype of variants
- -removeduplicates: remove duplicated variants
MapVariants: Map the personalized reference genome-based coordinates of the variants to their corresponding coordinates in the universal reference genome.
> -F MapVariants –I input.vcf –O output.vcf –germlinemutations variant.vcf [–intervals input.intervals] [-genotype home/heter] [-removeduplicates]
Parameters:
- -I: input the somatic mutations in vcf format
- -O: output the somatic mutations being mapped to the universal reference genome in vcf format
- -germlinemutations: input the germline mutations in vcf format
Options:
- -intervals: specify the region of variants
- -genotype: specify the genotype of variants
- -removeduplicates: remove duplicated variants
ReplaceGenotype: Replacing the alternative alleles in the sequencing reads with reference bases according to the heterozygous variants provided by the users.
> -F ReplaceGenotype –I input.sam –germlinemutations germlinemutations.vcf –O output.sam –readlength len [–intervals input.intervals] [-genotype home/ heter]
Parameters:
- -I: input the sequence alignment map file in sam format
- -variant: input the germline mutations in vcf format
- -O: output the replaced sequence alignment map file in sam format
- –readlength: the sequencing read length
Options:
- -intervals: specify the region of variants
- -genotype: specify the genotype of variants
ViewFasta: View specified region of sequence in reference genome.
> Usage: -F ViewFasta –R ref.fasta [–L input.list] [-region specified region]
Parameters:
- –R: input the reference genome file
- -L: input the specified region list file, this function was used for viewing multiple regions in the chromosome
- -region: input one specified region, this function was used for viewing single region in the chromosome
Example of region specifications format:
chr1: Output whole sequence of chromosome 1 in the reference genome.
chr2: 5000 Output the chromosome 2 sequence which begins at base position 5000 and ends at the end of chromosome 2.
chr3: 500-600 Output the chromosome 3 sequence which begins at base position 500 and ends at base position 600 of chromosome 3.
SomaticMutationsOnGermlineInsertion: Output the relative coordinate of somatic mutations located on germline insertions.
> -F SomaticMutationsOnGermlineInsertion –germlinemutations germlinemutation.vcf –I input.vcf –O output.txt [–intervals input.intervals] [-genotype home/ heter]
Parameters:
- -germlinemutations: input the germline mutations in vcf format
- -I: input the somatic mutations (using personalized coordinate system) in vcf formait
- -O: output the locations of somatic mutations on germline insertions
Options:
- -intervals: specify the region of variants
- -genotype: specify the genotype of variants
If you find PRESM useful towards your project, please cite the publication as located here.
- Chen Cao, [email protected]
- Quan Long, [email protected]