Skip to content

A phylogeny subsampling tool combining geographic proportions and phylogenetic diversity

License

Notifications You must be signed in to change notification settings

evolbioinfo/geo_subsampler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Geo subsampler

Geo_subsampler subsamples a given phylogenetic tree to rebalance the samples at different locations according to user-specified proportions. Moreover, for each location the kept samples are chosen in a balanced way over the sampling intervals (e.g. months). With these constraints in mind, the script uses phylogenetic diversity [Faith 1992] to pick the samples to be removed. Additional options allow to keep all the samples before a certain data, and to ensure a minimal number of samples picked by location, despite the other criteria.

Article

If you find geo_sampler useful, please cite:

A Zhukova, L Blassel, F Lemoine, M Morel, J Voznica, O Gascuel (2021) Origin, evolution and global spread of SARS-CoV-2 CRAS 344(1): 57-75 doi:10.5802/crbiol.29.

Installation

To install geo_subsampler, first install python 3, then run:

pip3 install geo_subsampler

Input data

As an input, one needs to provide a NON-dated phylogenetical tree in newick format, a metadata table containing tip names, locations and sampling dates, in tab-delimited (by default) or csv format (to be specified with '--sep ,' option). To subsample according to user-specified proportions, one should also input a location case counts, as tab(or comma, see the Detailed options below)-separated table whose first column contains locations and the second case counts.

Example

The folder example_data contains an example of an input tree (covid.nwk) representing an early SARS-COV-2 epidemic, the corresponding metadata table (metadata.tab), and a case count table (cases.tab).

The input tree contains 11 167 sampled tips.

The metadata table is a tab-separated file, containing tip ids in the first column, their countries of sampling in the second column, and the sampling dates in the third column:

id country sampling date
EPI_ISL_402119 China 30/12/2019
EPI_ISL_402123 China 24/12/2019
EPI_ISL_403962 Thailand 08/01/2020
... ... ...

The case count table contains numbers of declared cases for each country:

country cases
China 84024
Thailand 3017
... ...

The following geo_subsampler command subsamples the input tree according to the case proportions and (as much as possible) equally between the months, in order to keep 1000 tips:

geo_subsample --tree example_data/covid.nwk --metadata example_data/metadata.tab \
--location_column country --date_column "sampling date" --cases example_data/cases.tab \
--output_dir example_data/results --size 1000

The resulting tree is put into example_data/results folder: (covid.subsampled.0.nwk). This folder also contains the ids of the tips retained in the subsampled tree: (covid.subsampled.0.ids), and two tables with the statistics on the subsampling: case_counts.tab and case_counts_per_time.tab.

Detailed options

  • --tree TREE Path to the input phylogeny (NOT time-scaled) in newick format.
  • --metadata METADATA Path to the metadata table containing location and date annotations, in a tab-delimited format.
  • --index_column INDEX_COLUMN number (starting from zero) of the index column (containing tree tip names) in the metadata table. By default is the first column (corresponding to the number 0)
  • --location_column LOCATION_COLUMN name of the column containing location annotations in the metadata table.
  • --date_column DATE_COLUMN name of the column containing date annotations in the metadata table.
  • --cases CASES Path to the case count table, in a tab-separated format, with two columns. The first column lists the locations, while the second column contains the numbers of declared cases or proportions for the corresponding locations
  • --sep SEP Separator used in the metadata and case tables. By default, tab-separated tables are assumed.
  • --start_date START_DATE If specified, all the cases before this date will be included in all the sub-sampled data sets.
  • --size SIZE Target size of the sub-sampled data set (in number of samples). By default, will be set to a half of the data set represented by the input tree.
  • --repetitions REPETITIONS Number of sub-sampled trees to produce. By default 1.
  • --output_dir OUTPUT_DIR Path to the directory where the sub-sampled results should be saved.
  • --min_cases MIN_CASES Minimum number of samples to retain for each location.
  • --date_precision {year,month,day} Precision for homogeneous subsampling over time within each location. By default (month), will aim at distributing selected location samples equally over months.

About

A phylogeny subsampling tool combining geographic proportions and phylogenetic diversity

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages