Skip to content

Running GenEra

Josué Barrera Redondo edited this page Aug 23, 2024 · 5 revisions

Running GenEra

Here we provide the instructions for running GenEra and the information regarding the parameter options. Users should make sure that the required database(s) are set up.

Quick start for the impatient

The basic usage of GenEra is:

genEra -q [query_sequences.fasta] -t [query_taxid] -b [path/to/nr] -d [path/to/taxdump]

However, several optional arguments can be included to get the best results possible:

genEra -q [query_sequences.fasta] -t [query_taxid] -b [path/to/nr] -d [path/to/taxdump] \ 
-a [protein_list.tsv] -f [nucleotide_list.tsv] -s [evolutionary_distances.tsv] -i [true] -n [many threads]

If you prefer to use Foldseek instead of DIAMOND, the basic usage is:

genEra -Q [query_structure_directory] -t [query_taxid] -B [path/to/alphafoldDB] -d [path/to/taxdump]

And of course, it can also be optimized to get better results:

genEra -Q [query_structure_directory] -t [query_taxid] -B [path/to/alphafoldDB] -d [path/to/taxdump] \ 
-A [structural_prediction_list.tsv] -s [evolutionary_distances.tsv] -i [true] -n [many threads]

Arguments and input files

GenEra always requires this input:

Depending on which alignment method you want to use, you'll have to choose between these two input files:

  • -q     A standard FASTA file containing all the protein sequences of your species of interest. Make sure all your sequence headers are unique and make sure to avoid regular expressions like \t or \n in the sequence headers, as these will interfere with the pipeline.

  • -Q     A directory containing all the protein structural predictions of your species of interest in PDB format. Make sure that all your PDB files have unique names and that they are uncompressed. Also make sure to avoid regular expressions like \t or \n in the PDB file names, as these will interfere with the pipeline.

Additionally, GenEra requires one of the following three database files:

  • -b     A locally installed NR database for DIAMOND (or any other database whose sequences can be automatically traced back to NCBI taxids). This database option is only compatible with FASTA input files (-q).

  • -B     A locally installed AlphaFold database for Foldseek (you can choose between several options that can be automatically downloaded using Foldseek). This database option is only compatible with PDB input files (-Q).

  • -p     A pre-generated DIAMOND/Foldseek table. This file is automatically generated by GenEra in the first step of the pipeline, either by choosing DIAMOND or Foldseek, and can be given to GenEra in case the user decides to re-run the pipeline (e.g., fine-tuning the parameters). The first column contains the query sequence IDs (qseqid), the second coulmn contains the target sequence IDs (sseqid), the third column contains the e-value of the matching alignment (evalue), the fourth column contains the bitscore of the alignment (bitscore) and the fifth column contains the taxonomy ID of the target sequence (staxids).

And one of the following three files:

  • -d     The location of the NCBI taxonomy files, colloquially known as the taxdump. Make sure the taxdump directory contains the files nodes.dmp, names.dmp and fullnamelineage.dmp. These files are used by NCBItax2lin to create a ncbi_lineages file and for GenEra to organize this file in the correct hierarchical order.

  • -r     The location of the uncompressed ncbi_lineages file generated by NCBItax2lin. Once the user runs GenEra for the first time (or runs NCBItax2lin independently), this file can be used to save some time during step 2 of the pipeline. This file is a comma-delimited table with each row representing a specific lineage, and each column representing the taxonomic hierarchies to which that lineage belongs. This file will be automatically modified by genEra to rearrange the columns in a hierarchical order using the fullnamelineage.dmp found in the taxdump. Alternatively, GenEra will try to download the lineage information from the NCBI Taxonomy webpage. IMPORTANT: Some users have experienced memory errors while using the -d argument. If your analysis stops after this is written in the STDOUT Preparings all lineages into a dataframe to be written to disk ..., you are likely experiencing the same issue, which is related to the dependency ncbitax2lin. We uploaded a compressed "ncbi_lineages" file that can be used with -r after uncompressing it. This should give you the exact same result as if you ran the pipeline from scratch with -d.

  • -c     Custom ncbi_lineages file that is already tailored for the query species. GenEra modifies the raw ncbi_lineages file so that the taxid of the query species appears in the first column, and the taxonomic hierarchies of the query species are rearranged from the species level all the way back to the last universal common ancestor (termed “cellular organisms” by the NCBI taxonomy). The file is also modified by GenEra to collapse the phylostrata (i.e., the taxonomic levels) that lack the necessary genomic data to be useful in the analysis. Once this file is generated, the user can re-use this file if they want to run GenEra on the same species with different parameters. By default, GenEra will search for the correct hierarchical order of the query organism’s phylostrata directly from the NCBI webpage. If the user machine is unable to access the NCBI webpage, GenEra will attempt to infer the correct order of the phylostrata directly from the ncbi_lineages file. However, there are some taxonomic hierarchies that the NCBI labels as “clades” (e.g., Embryophyta in the plant lineage), which GenEra cannot automatically assign to their correct taxonomic level when working offline. This option is also useful if the user wants to modify the taxonomic hierarchies that were originally assigned by the NCBI (e.g., an outdated taxonomy that does not reflect the phylogenetic relationships of the query species). This custom table can be easily generated by printing the desired columns of the ncbi_lineages file with awk:

awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' ncbi_lineages_[date].csv | awk -F "," '{ print $1","$8","$7","$6"," ... }' > custom_table.csv

The user can also incorporate five other inputs, which are optional but can potentially improve the final age assignment:

  • -v     List with protein FASTA files to perform an infraspecies-level analysis against the query proteome (e.g., different strains or subspecies that do not have a NCBI Taxonomy ID). This parameter is particularly useful for detecting candidate de novo genes that emerged recently. The user can give GenEra a precomputed phylogeny with -z or let GenEra automatically estimate the evolutionary relationship between the listed proteomes using OrthoFinder. Make sure the file names are unique and recognizable. Also avoid including inner dots in the name of the FASTA files, as this may result in problems throughout the pipeline.

  • -z     Precomputed phylogeny in NEWICK format for the infraspecies-level analysis. Use this option if you already know the evolutionary relationship of the genomes that you want to incorporate into the analysis. Make sure that the names in the phylogeny match the prefixes of the FASTA files that were listed with -v.

  • -a     Tab-delimited table with additional proteins to be included in the analysis (only compatible when using FASTA sequences as input). This option is particularly useful if the user wants to include proteins from species that are absent from the nr database, such as newly annotated genomes or transcriptomes. It can also help fill the gaps of phylostrata that would otherwise be collapsed due to lack of available genomes in the database. Importantly, each protein in this custom dataset should have unique identifiers, otherwise GenEra will not work properly during the taxonomic assignment of the Diamond hits. The table format consists of the location of one protein FASTA file for each species in the first column and the NCBI taxonomy ID of that species in the second column:

     /path/to/species_1.fasta	taxid_1
     /path/to/species_2.fasta	taxid_2
     /path/to/species_3.fasta	taxid_3
    
  • -f     Tab-delimited table with additional nucleotide sequences (e.g., non-annotated genome assemblies) to be searched against your query proteins with MMseqs2 in a "tblastn" fashion (only compatible when using FASTA sequences as input). This option is extremely useful to improve the detection of orphan genes, since errors in the genome annotations can oftentimes lead to the spurious detection of taxonomically-restricted genes (Basile et al., 2019; Weisman et al., 2022). Importantly, each contig/scaffold/chromosome in this dataset should have a unique identifier (e.g., avoid having multiple >chr1 sequences throughout your genome assemblies). The table format consists of the location of one nucleotide FASTA file for each genome/transcriptome assembly in the first column and the NCBI taxonomy ID of that species in the second column:

     /path/to/assembly_1.fasta	taxid_1
     /path/to/assembly_2.fasta	taxid_2
     /path/to/assembly_3.fasta	taxid_3
    
  • -A     Tab-delimited table with additional structural predictions to be included in the analysis (only compatible when using PDB files as input). This option is useful if the user wants to include protein folding predictions from species that are absent from the AlphaFold database, such as newly annotated genomes. It can also help fill the gaps of phylostrata that would otherwise be collapsed due to lack of available genomes in the database. Importantly, each PDB file in this custom dataset should have a unique identifier, both within and between species, otherwise GenEra will not work properly during the taxonomic assignment of the Foldseek hits. The table format consists of the location of one directory containing all the uncompressed PDB files for each species in the first column and the NCBI taxonomy ID of that species in the second column:

     /path/to/species_directory_1	taxid_1
     /path/to/species_directory_2	taxid_2
     /path/to/species_directory_3	taxid_3
    
  • -s     Tab-delimited table with pairwise evolutionary distances, calculated as substitutions/site, between several species in a phylogeny and the query species. This file is necessary to calculate the detection failure probabilities of the query genes in step 4 using abSENSE (Weisman et al., 2020), which allows GenEra to find gene age assignments and gene-family founder events that cannot be explained by homology detection failure (HDF). Please note that HDF-tested gene age assignments can only be calculated whenever the user gives GenEra evolutionary distances of species that are found in the taxonomic level that immediately precedes the one that is being studied (i.e., the next older taxonomic level), meaning that genEra will not be able to evaluate DHF-tested assignments for the oldest taxonomic level that contains species representatives in the file with evolutionary distances, nor for taxonomic levels that lack an outgroup species in the file with evolutionary distances. Also, note that HDF-tested gene age assignments cannot be calculated for species-specific genes, reason being that detection failure probabilities cannot be calculated for genes that lack any traceable homologs (please read Weisman et al., 2020 for more details). Finally, please make sure that the species from which you have evolutionary distances are also represented in the databases that are being used for the homology search using either DIAMOND or Foldseek. In order to calculate reliable detection failure probabilities, it is important to have as many closely related species as possible in the evolutionary distance file. Extracting the pairwise distances between any pair of species can be easily achieved with the ape package in R (Paradis et al., 2019), if you have a phylogenetic tree in NEWICK format:

library(ape)
tree<-read.tree(file = "phylogenetic_tree.nwk")
distances<-cophenetic.phylo(x = tree)
write.table(distances,"pairwise_distances.tsv", row.names = TRUE, sep = "\t", quote = FALSE)

     Afterwards, the distance table should contain the species taxid in the first column and the distance in substitutions per site in the second column (NOTE that the query species is also included in the table, with an evolutionary distance of 0):

	query_sp_taxid	0
	species_1_taxid	distance_1
	species_2_taxid	distance_2
  • -i     Boolean argument that, when true, generates an additional output file with the best sequence/structure hit responsible for the oldest gene age assignment for each of the query genes. This file is mainly used to verify whether the oldest age assignment of genes are not driven by false positive matches to the database. Established by default as false to save computing time.

Fine-tunning and other useful arguments

These arguments can be modified to suit the user's needs, but keeping the default parameters is usually fine:

  • -n     Number of threads to run GenEra. We suggest to modify this parameter in accordance with the user's needs and resources. Running GenEra is computationally expensive, so using a small amount of threads will result in long running times. By default, GenEra uses all the available threads in the system.

  • -o     Path to the directory where you want to store the final output files (GenEra will automatically create this directory if it doesn't exist). Many users like to keep things tidy and separate the output files to a specific location on their computer. By default, GenEra stores the output files within the working directory.

  • -l     Taxonomic representativeness threshold below which a gene will be flagged as putative genome contamination or the product of a horizontal gene transfer event. The threshold is established as 30% by default, but it can be freely modified by the user. Please refer to GenEra's preprint for a detailed explanation of taxonomic representativeness.

  • -e     E-value threshold for DIAMOND or Foldseek, in order to consider a sequence match as a homolog. This value is established as 1e-5 by default, but the user can modify it to a more relaxed or stringent threshold.

  • -u     Additional options to feed DIAMOND or Foldseek, based on the user's preferences. This can be useful if, for example, the user wants to filter the homology results based on identity thresholds or query coverage thresholds, instead of the e-value. Users should input the additional commands in quotes, using the original arguments from DIAMOND (for example: -u "--id 30") or from Foldseek (for example: -u "-c 0.8 --cov-mode 0").

  • -m     Minimum percentage of matches between your query sequences and another species to consider it useful for the gene age assignment. This parameter only determines whether a taxonomic level will be collapsed or not during step 2 of the pipeline. The parameter is implemented so that GenEra collapses the taxonomic levels where all the representative species lack WGS data in the databases. By default, this threshold is empirically established as 10% of sequence matches with respect to the number of genes in the query species. For example, if the query species has 26,000 genes, and all the representative species for a given phylostratum contain less than 2,600 matches to the query proteins, then the taxonomic level is "collapsed" or removed from the analysis.

  • -x     Alternative path where the user would like to store the temporary files, as well as the DIAMOND/Foldseek results. GenEra generates HUGE temporary files (usually within the range of dozens to hundreds of Gigabytes in size), so users might have a hard time storing all these files within the working directory. Thus, the users can redirect the temporary files to a location where there is enough storage space for GenEra to run correctly. By default, the files are stored in a temporary directory within the working directory (tmp_[TAXID]_[RANDOMNUM]/), which is automatically created by GenEra.

  • -y     Modify the sensitivity parameter in DIAMOND. By default, GenEra runs DIAMOND in sensitive mode to retrieve the highest amount of homologs in a reasonable amount of time. Users can decide to run DIAMOND in ultra-sensitive mode to achieve slightly higher sensitivity at the cost of speed, or the users might prefer to sacrifice sensitivity in exchange for faster results by using fast mode (please refer to Figure 3 of GenEra's paper to make an informed decision while modifying this parameter).

  • -M     Adjust the amount of prefilter made by Foldseek (i.e., the maximum number of hits that are reported in the output file). Unlike DIAMOND, Foldseek lacks the option to print an unlimited amount of sequence hits, which may hamper the ability of the user to trace back very deep evolutionary homologs for the query proteins. We prefer to keep this parameter as high as possible, although this decision is usually limited by the amount of RAM and storage that Foldseek will use. By default, GenEra limits the maximum number of hits to 10000, but the user can modify this parameter in accordance with their needs and hardware limitations.

  • -F     Enable (true) or disable (false) the fast version of GenEra at step 3 (fast mode is enabled by default).As of version 1.4.0, GenEra runs faster at the expense of an intensive use of computational resources, particularly RAM usage. Users with limited computational resources may prefer to disable fast mode.

  • -j     When true, GenEra runs an additional search step using JackHMMER against the online Uniprot database for all the genes that could not be traced back to "cellular organisms" using DIAMOND. For this, GenEra uses the R package Bio3D to submit individual jobs to the EMBL-EBI server. We decided to use this method because the local version of HMMER does not output taxids (which are crucial for GenEra), while the online version does. This severely limits the speed of this search step, since it cannot be multithreaded. Additionally, it requires constant access to the internet to work, which might complicate things. Nonetheless, we implemented this step for users who prefer to search for protein homologs using HMM-profile methods over pairwise alignment methods.

  • -k     The user can modify this parameter to decide the starting taxonomic level of the genes that will be re-analyzed when running JackHMMER with -j. By default, GenEra runs JackHMMER over every gene that could be traced back from the second oldest taxonomic level onwards, namely the taxonomic rank 2. However, genes that were assigned within the second oldest taxonomic rank (e.g., Eukaryota, Bacteria, Archaea) will usually have a great amount of sequence hits using JackHMMER, which slows down the process a lot. Therefore, the user can decide to limit the JackHMMER searches to younger taxonomic levels to speed up the process and make the computation times feasible.