[] []

VIRUSBreakend - Viral Integration Recognition Using Single Breakends

VIRUSBreakend is a high-speed viral integration detection tool designed to be incorporated in the whole genome sequence piplines with minimal additional cost. VIRUSBreakend takes on average 1 hour to run on a 100x coverage human sample. Recommended job/VM size is 4 core / 64Gb memory.

This tool is part of GRIDSS - the Genomic Rearrangement IDentification Software Suite.

Citation

Please cite:

VIRUSBreakend: https://doi.org/10.1093/bioinformatics/btab343
GRIDSS2: harnessing the power of phasing and single breakends in somatic structural variant detection https://www.biorxiv.org/content/10.1101/2020.07.09.196527v1

Pre-requisites

VIRUSBreakend is part of the GRIDSS software suite. It can be downloaded from https://github.com/PapenfussLab/gridss/releases

All tools used by VIRUSBreakend must be on PATH including:

java
GRIDSS
Kraken2
RepeatMasker
samtools
bcftools
bwa

Set the GRIDSS_JAR environment variable to the location of the GRIDSS jar file

Reference data

The VIRUSBreakend reference data can be downloaded from AWS: https://virusbreakend.s3.us-east-2.amazonaws.com/virusbreakenddb_20210401.tar.gz

The VIRUSBreakend reference data can also be downloaded from https://melbourne.figshare.com/articles/dataset/virusbreakenddb_20210401_tar_gz/14782299

Reference data contents

VIRUSBreakend uses a standard Kraken2 database augmented with NCBI viral neighbour genomes and associated metadata files.

The reference file database contains the following files:

file	description
library/viral/library.fna	NCBI RefSeq viral genomes
library/added/*.fna	NCBI viral neighbour genomes
*.fai	samtools index of associated .fna file
*.fna.dict	Picard tools sequence dictionary of associated .fna file
hash.k2d	Kraken2 database file
taxo.k2d	Kraken2 database file
opts.k2d	Kraken2 database file
taxonomy/nodes.dmp	Snapshot of NCBI taxonomy nodes at time of database creation
taxonomy/names.dmp	Snapshot of NCBI taxonomy names at time of database creation. Not technically required by VIRUSBreakend but useful for looking up the name of the assigned taxid
seqid2taxid.map	Kraken2 mapping of sequences to NCBI taxonomy IDs
taxid10239.nbr	Table of viral neighbour genomes and their associated RefSeq genome

The exact sources for these files can be determined from the paths and queries in virusbreakend-build.

Regenerating the VIRUSBreakend database

Run virusbreakend-build --db virusbreakenddb to download and generate the reference data. This download the NCBI taxonomic information, sequences, virushostdb, the kraken2-build build process, and generates indexes. The index is around 54GB in size. Be aware that the kraken2 build process requires around 150GB of intermediate disk space to download from NCBI and build the index.

WARNING: downloading from NCBI fails regularly. It is likely you will have to manually rerun each failing step in virusbreakend-build to get a functioning database.

virusbreakend-build requires:

An internet connection
GRIDSS
samtools
Kraken2
- dustmasker (part of the NCBI BLAST+ package)
EDirect utilities

These can be installed manually, or the BioConda samtools kraken2 and entrez-direct packages can be used.

The the generated directory (virusbreakenddb) contains the files used in the build process. The file virusbreakend.db.virusbreakenddb.tar.gz contains the minimal subset of the virusbreakenddb data required by VIRUSBreakend.

Running VIRUSBreakend

To run VIRUSbreakend, ensure that the database has been build and run the following command:

virusbreakend \
	--kraken2db virusbreakenddb \
	--output sample.virusbreakend.vcf \
	--reference host_reference.fa \
	sample.bam

Reference genomes with viral sequences

Some reference genomes include viral sequence. hs37d5, hs38DH, and GRCh38_full_analysis_set_plus_decoy_hla all include EBV. For VIRUSBreakend to report EBV infection, reads aligned to the EBV sequence ( or chrEBV)

By default, VIRUSBreakend considered any read mapped if it maps to chrEBV or any viral reference sequence in the VIRUSBreakend database (library/viral/*.fna.fai or library/added/*.fna.fai). When checking for matches, VIRUSBreakend uses the exact name, the kraken stripped name, and also the version stripped named. For example, the kraken2 database contains kraken:taxid|10376|NC_007605.1 so VIRUSBreakend will exclude kraken:taxid|10376|NC_007605.1, NC_007605.1, and NC_007605 (thus picking up the NC_007605 EBV reference in hs37d5.fa)

A custom list can be supplied with the --viralreferences command line parameter.

Output

VIRUSBreakend outputs:

A VCF containing the integration breakpoints
The kraken2 report of the virus(es) for which viral integration was run upon
Coverage statistics of the virus(es) for which viral integration was run upon

summary.tsv files

field	meaning
taxid_genus	NCBI taxonomy ID of genus of viral reference
name_genus	NCBI taxonomy genus name
reads_genus_tree	Number of reads assigned to any virus in that genus by kraken2
taxid_species	NCBI taxonomy ID of species of viral reference
reads_species_tree	Number of reads assigned to that species (and any sub-species) by kraken2
name_species	NCBI taxonomy species name
taxid_assigned	NCBI taxonomy ID of viral reference
name_assigned	NCBI taxonomy name of viral reference
reads_assigned_tree	Number of reads assigned to the viral reference or taxonomic descendents by kraken2
reads_assigned_direct	Number of reads assigned to the viral reference taxid by kraken2
reference	name of viral reference used.
reference_taxid	NCBI taxonomy ID of viral reference used. This can be a child taxid of taxid_assigned.
reference_kmer_count	count of read kmers matching reference used
alternate_kmer_count	count of read kmers matching next best reference candidate
#rname	name of adjusted viral reference
startpos	start position of adjusted viral reference. Always 1
endpos	end position of adjusted viral reference. Always equal to the viral contig length
numreads	Number of reads mapped to adjusted viral reference
covbases	Number of bases with at least 1 read mapped
coverage	Percentage of viral sequence with at least one read mapped
meandepth	Mean alignment depth
meanbaseq	Mean base quality of bases mapped to adjusted viral reference
meanmapq	Mean mapping quality of reads mapped to adjusted viral reference
integrations	Number of integration breakpoints found.
QCStatus	QC status of viral integration

Each viral integration should have 2 integration breakpoints (one for the start, one for the end) although there may be more if the integration site is rearranged or less if one side of the integration site was missed.

QCStatus

QCStatus can be one of the following:

"" The empty string indicates no warning or errors associated with this virus
LOW_VIRAL_COVERAGE

Less than 10% of the virus has any coverage. This typically occurs when the viral sequence has a short sequence homology with non-viral sequence. This record is likely a false positive and should be ignored.

EXCESSIVE_VIRAL_COVERAGE

Regions of the virus have depth of coverage so high, that GRIDSS was unable to call integrations in these regions. The exact bounds of excluded regions can be found in the *.coverage.blacklist.bed files in the virusbreakend.working directory

ASSEMBLY_DOWNSAMPLED

Viral coverage was sufficiently high that some regions were downsampled in the GRIDSS assembly process. With the assembly downsampling changes in GRIDSS 2.11.1, viral integration have in these region have probably been correctly called but the QUAL score is lower than it should be. The exact bounds of excluded regions can be found in the *.bed.excluded_*.bed files in the virusbreakend.working directory

CHILD_TAXID_REFERENCE

The viral genome chosen by VIRUSBreakend does not match the NCBI taxonomic classification assigned by Kraken2. This typically occurs in scenarios such as HPV-45 where Kraken2 assigns reads to the parent taxon due to sequence commonality between HPV-45 and HPV-18. This is expected behavour for these viral taxa. The taxon assigned reference_taxid

UNCLEAR_TAXID_ASSIGNMENT

Less than 60% of reads assigned to nominal kraken2 assigned taxon were directly assigned (the rest were assigned to child taxa). The taxonomic assignment may be unclear - possibly due to co-infection by multiple viruses within the genus.

Ambiguous insertions

The key differentiator of VIRUSBreakend is the ability to detect and classify integration sites in repetative sequences such as centromeres. Due to the repetative nature of these region, such integration sites cannot be unambigously placed in the host genome. In such cases, the mapq encoded in the BEALN field will be 0 and the field may contain multiple candidicate integration sites. Integration sites in which the reported position is ambiguous have a LOW_MAPQ FILTER applied.

The INSRM field contains the repeat sequences identifed in the integration site host sequences. These annotations can be used to classify ambigous integration sites.

By default, VIRUSBreakend filters out potential integrations with signficant overlap with simple or low complexity repeats since such integrations are typically false positives. If a more sensitive call set is required, this filter can be removed by rerunning the final VirusBreakendFilter steps with the MINIMUM_REPEAT_OVERLAP parameter set to any value greater than one (ie require more than 100% repeat overlap).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VIRUSBreakend_Readme.md

VIRUSBreakend_Readme.md

VIRUSBreakend - Viral Integration Recognition Using Single Breakends

Citation

Pre-requisites

Reference data

Reference data contents

Regenerating the VIRUSBreakend database

Running VIRUSBreakend

Reference genomes with viral sequences

Output

summary.tsv files

QCStatus

Ambiguous insertions

Files

VIRUSBreakend_Readme.md

Latest commit

History

VIRUSBreakend_Readme.md

File metadata and controls

VIRUSBreakend - Viral Integration Recognition Using Single Breakends

Citation

Pre-requisites

Reference data

Reference data contents

Regenerating the VIRUSBreakend database

Running VIRUSBreakend

Reference genomes with viral sequences

Output

summary.tsv files

QCStatus

Ambiguous insertions