Skip to content

RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes

License

Notifications You must be signed in to change notification settings

GreyGuoweiChen/VirHost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes

Viruses are obligate intracellular parasites that depend on living organisms for their replication and survival. While the application of metagenomic high-throughput sequencing technologies have facilitated the discovery of the viral dark matter, how to determine the hosts of the metagenome-originated viruses remains challenging owing to the complex composition of the metagenomic sequencing samples. The high genetic diversity of RNA viruses poses a great challenge to the alignment-based methods.

Here, we introduce RNAVirHost, a machine learning-based tool that predicts the hosts of RNA viruses solely based on viral genomes. It takes complete RNA viral genomes as input and predicts the natural host groups from kingdom level to order level.

RNAVirHost is designed as a two-layer classification framework to hierarcically predict the host groups of query viruses. Layer 1 contains 5 branches (Chordata, Invertebrate, Viridiplantae, Fungi, Bacteria), which are categorized into kingdom and phylum level. Layer 2, designed for Chordata subtree, has 10 leaves, which are at the class and order level. RNAVirHost utilizes a hierarchical approach to predict the host lineage along the tree. It combines various factors, including virustaxonomic information, viral genomic traits, and sequence homology, to achieve accurate host prediction. To provide different levels of confidence in the predictions, RNAVirHost incorporates prediction score cutoffs. This cutoff allows users to classify the predictions into different confidence level.

Dependency:

  • python 3.x
  • BLAST 2.12.0
  • Prodigal 2.6.3
  • xgboost 1.7.4
  • pandas 2.0.3
  • scikit-learn 1.1.3
  • biopython 1.83
  • numpy 1.23.5

Quick Installation

  1. We highly recommend using conda/mamba to install all dependencies.

    conda install -c bioconda rnavirhost
    
  2. Or you could build the environment from GitHub:

    git clone https://github.com/GreyGuoweiChen/VirHost.git
    cd VirHost
    
    # Create the environment and install the dependencies using conda or mamba
    mamba env create -f environment.yml
    
    # Activate the environment
    conda activate rnavirhost
    
    # Distribute RNAVirHost to your conda environment
    pip install .
    
  3. (Alternative) You could install it using pip to avoid all the dependencies, but may fail.

    pip install rnavirhost
    
    # You can check the installation by calling:
    conda list rnavirhost
    

Usage:

To enable RNAVirHost's prediction, the viral taxonomic information is required and its format (.csv) is as below.

y|virus order
NC_000858.1 Ortervirales
NC_019922.1 Norzivirales
... ...

The file comprises (r+1) rows and two columns, where (r+1) represents the r sequences in the query fasta file and an additional header row. The first column denotes the accession numbers of sequences (sequences' ID), and the second column represents the corresponding order labels. If a label falls outside our designated range, it is acceptable to input them to the host prediction stage. We will output the corresponding sequence ID in a separate file.

  1. For user convenience, we provide a simple alignment-based method for classifying virus sequences at the order level using BLASTN. The code for this method is shown below. Generally, the order-level predictions provided by BLASTN are sufficiently accurate. However, if users desire to refine the classification using other programs, they can follow the file format outlined in the table above.
rnavirhost classify_order [-i INPUT_CONTIG] [-o TAXONOMIC_RESULT]
  1. The input files include the fasta file and the corresponding taxonomic information table.
rnavirhost predict [-i INPUT_CONTIG] [--taxa TAXONOMIC_INFORMATION_OF_INPUT] [-o OUTPUT_DIRECTORY]

Programs' Option

classify_order:

-i: The input contig file in fasta format.
-o: The taxonoic information of query viruses at order level (default: RVH_taxa.csv).

predict:

-i, --input: The input contig file in fasta format.
--taxa: The input csv file corresponding to the virus order labels of the queries (default: RVH_taxa.csv).
-o, --output: The output directory (default: RVH_result).

Output

The output format is as:

y|virus order pred|L1 pred|L2 evidence
NC_000858.1 Ortervirales Chordata Primates pred_high_confidence
NC_001426.1 Norzivirales Bacteria assign
... ... ... ... ...

The evidence list has 5 labels, including pred_high_confidence, pred_low_confidence, assign, BLASTn, and unclassified. The "pred" prefix represent that the prediction is made by the learning model. If the prediction score is higher than the built-in score cutoff, we regard it as a high-confidence prediction and thus give "pred_high_confidence"; otherwise, we will give "pred_low_confidence". The "assign" evidence means that the reference viruses in the same order with the query virus infect a specific host group. So we assign the host group to the virus order without prediction.

Besides, RNAVirHost encodes the query sequences at the protein level. To obtain the protein level representation, we translate the sequences into proteins by prodigal first. For those sequences do not encode proteins, it is challenging to generate a reliable result, and RNAVirHost does not accept these non-coding sequences for higher confidence. However, for user's convenience, we try to predict hosts of these sequences adopting the best alignment strategy by BLASTn. We aligned the query virus against our reference database, assign the host as the label of the best hit reference, and give "BLASTn" evidence. Finally, we give the "unclassified" evidence to query viruses which do not have order information.

Example

# create the taxonomic file first
rnavirhost classify_order -i test/test.fasta 

# predict the host of query viruses
rnavirhost predict -i test/test.fasta -o RVH_result

About

RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages