Skip to content

pfeiferd/genestrip-db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genestrip-DB - a selection of databases for Genestrip

This project contains some configuration files and a two scripts in order to generate databases and indexes for metagenomic analysis via Genestrip.

License

Genestrip-DB is free for any kind of use. However, the associated software, Genestrip, has a more restrictive License.

Building and installing

Genestrip-DB requires Maven 2 or 3 and the JRE 1.8 or higher.

To build the databases and indexes, cd to the installation directory genestrip-db. Given a matching Maven and JDK installation, sh bin/makedbs.sh will generate 8 databases (and indexes) of different sizes. The generation process is resource intensive and may take several days for all databases. Generating the bacterial databases is particularly time consuming.

Your machine should have:

  • 650 GB of free disk space - mainly for downloading genomes from NCBI,
  • at least 8 cores - the more the better (some phases of the database generation keep 32 cores 100% busy),
  • 32 GB of main memory,
  • a high bandwidth Internet connection.

The databases are based on and compatible with Genestrip v1.1.

The databases

All databases are purely genomic.

Name Category Description Database disk size Sources and references
babesia protozoa Babesia species from the RefSeq and Genbank which are potentially pathogenic for humans 936 MB General knowledge
borrelia bacteria Borrelia species from the RefSeq which are potentially pathogenic for humans 844 MB General knowledge
borrelia_plasmid plasmid Borrelia species from the RefSeq which are potentially pathogenic for humans 205 MB General knowledge
chronicb bacteria Potentially tick-borne infections which are potentially pathogenic for humans and may lead to chronic diseases 4.34 GB Collected from Armin Labs
human_virus2 viral Viruses from the RefSeq and Genbank which are potentially pathogenic for humans 89 MB Extracted from the Viral Zone
parasites invertebrate Parasitic invertebrate animals from the RefSeq which are potentially pathogenic for humans 20.26 GB Collected from the book "Die Parasiten des Menschen" by Heinz Mehlhorn
protozoa protozoa Protozoan parasites from the RefSeq which are potentially pathogenic for humans 14.46 GB Collected from the German book "Die Parasiten des Menschen" by Heinz Mehlhorn
vineyard fungi Fungal infections of grapevine taken from the RefSeq 4.08 GB Collected from the German book "Rebschutz" by Walter Hildebrand, Dieter Lorenz and Friedrich Louis

Note that Genestrip's updateddb-phase accounts for unspecific k-mers and largely avoids false positive counts during matches. To further reduce false positives, all databases (except for vineyard) are built such that k-mers also occurring in the human genome are pushed to the least common ancestor.

Testing the databases borrelia, borrelia_plasmid and chronicb

The script bin/matchticks.sh runs the Genestrip goal matchlr for 11 fastq files taken from this publication. To do so, the fastq files will be streamed from the corresponding NCBI server. As expected, Genestrip finds DNA from borrelia and other tick-borne infections accordingly.

Downloading the ready-made databases

If you don't want to generate them yourself, the databases and indexes can also be downloaded from Google Drive. The Drive folder corresponds to the projects folder's state of this project, after the scripts bin/makedbs.sh and bin/matchticks.sh have run successfully.