Databases

Several databases are needed for some of the scripts of this pipeline, and they need to be formatted in a certain way. You need to have the preformatted databases in a subdirectory named /mg_pipeline/databases/

Available databases in figshare are:

16S or 18S

SILVA ver. 132 (see original source)
RDP ver. 11.5 (see original source)

16S

RDP V3-V4 ver. 11.5 Trimmed for the 16S V3 and V4 regions.

18S

Protist Ribosomal Reference database (PR2) Protista ver. 4.7.2.

We recommend getting the EzBioCloud curated database, but since it is not publicly available (although it is free for academia), we cannot distributed it. If you get it, then you´ll have to formatted accordingly. You can use our script db_reformatter.sh

The databases have to have the following format:

>accession:domain;phylum;class;order;family;genus;species
agtcgggcttaggtaaaaa

Since the RDP database is to big and consumes a lot of time and memory, we have only the V3-V4 regions cut out from the original db, this is a lot faster, of course it only works if your sequences are from the V3 and/or V4 16S rRNA regions. Also the RDP db has many sequences (362,293) duplicated, it is now dereplicated (only one sequence of the identical ones was kept).

A much quicker analysis is done if the databases are converted to UDB format (a UDB file is a database file that contains the sequences and a k-mer index for those sequences). These type of databases a considerably bigger than the fasta file used to generate it (56 Mb vs. 471 Mb for the SILVA-128 db), therefore, it is best to download the fasta file and then convert it. UDB databases can be used only with mg_classifier; chimera_detector only accepts fasta files.

To convert the fasta file with vsearch 2.5.0 just type: vsearch --makeudb_usearch file.fasta --output file.udb

Keep both the fasta and the udb files, since mg_classifer can use both, but chimera_detector only the fasta file. This inconvenience has to be with the algorithm that vsearch uses to identify chimeric sequences.

Download

Download the databases from figshare: wget https://ndownloader.figshare.com/files/9924862 Since the databases are quite big (584 Mb) it might take a while to download. The four databases are compressed into one file.
Rename the file, Figshare assigns just a number to the downloaded file, so it is best to give it a meaningful name.: mv 9924862 mg_pipeline_dbs.tar.gz
Uncompress them: tar xzf mg_pipeline_dbs.tar.gz
If the needed directory was not already created, create it to house the mg_pipeline script and databases: sudo mkdir /opt/mg_pipeline/ /opt/mg_pipeline/databases for this you'll have to be a super user with sudo. You can use a different directory, but the scripts points to this one and you'll have also to mess with the code of the scripts to point to the other directory so, better stick to this.
Move the dbs to their appropiate directory: sudo mv *.fasta /opt/mg_pipeline/databases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

databases.md

databases.md

Databases

16S or 18S

16S

18S

Download

Files

databases.md

Latest commit

History

databases.md

File metadata and controls

Databases

16S or 18S

16S

18S

Download