Skip to content

A quick user guide for de novo transposable element (TE) library generation and TE screening. Utilising; the Extensive de novo TE Annotator (EDTA), DeepTE, RepeatMasker and RM_TRIPS.

Notifications You must be signed in to change notification settings

ellenbell/FasTE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 

Repository files navigation

FasTE

FasTE is designed to be used as a quick guide for de novo transposable element (TE) library generation and subsequent TE screening (Bell et al., 2021).
Part 1: TE library generation, utilises the packages; Extensive de novo TE Annotator (EDTA, Ou et al., 2019) and DeepTE (Yan et al., 2020) which may be used in tandem for de novo TE annotation and classification.
Part 2: TE screening, demonstrates how newly made libraries can be used in conjunction with RepeatMasker (Smit et al., 2013-2015) for repeat detection and outputs parsed with RM_TRIPS prior to downstream analysis.

Screenshot 2021-06-16 at 11 29 14

Dependencies

Extensive de novo TE Annotator (EDTA)
DeepTE
RepeatMasker
RM_TRIPS

Recommended installation for EDTA

Download the latest EDTA.

git clone https://github.com/oushujun/EDTA.git

Find the .yml file in the folder and run.

conda env create -f EDTA.yml

Recommended installation for DeepTE

Download the latest DeepTE scripts.

Install conda: https://www.anaconda.com/products/individual
conda create -n py36 python=3.6
conda activate py36
conda install tensorflow-gpu=1.14.0
conda install biopython
conda install keras=2.2.4
conda install numpy=1.16.0

If this installation has been completed the following commands will apply.

Part 1: TE Library Generation

TE Annotation with EDTA

conda activate EDTA 
perl [path to EDTA script]/EDTA.pl --genome [path to fasta file genome assembly] --species others --sensitive 1 --threads 42 
exit

 --genome [file]                  Path to genome FASTA file 
 --species [rice|Maize|others]    In this instance we were working on a teleost fish species so used "others" 
 --sensitive [0|1]                Use RepeatModeler to identify remaining TEs (1) or not (0, default), we ran it with RepeatModeler 
 --threads                        Number of threads to run this script (default 4), we ran it with 42 

Other settings are available, see https://github.com/oushujun/EDTA 
Also note that EDTA doesn't like headers with more then 15 characters, so some header editing may be required

This was tested with Linux Ubuntu (v18.04.5), 32 cores, 64 threads, 128GB RAM on a genome (size c.700MB).
On this system with this genome, EDTA ran in c.60 hours.

TE Classification with DeepTE

conda activate py36
python [path to DeepTE]DeepTE.py -d [path to working directory] -o [path to output directory] -i [path to EDTA library FASTA] -sp M -m M
exit

-d               Pathway to a working directory where intermediate files for each step are stored
-o               Pathway to an output directory where output files are stored
-i               Input sequences that are unknown TE or DNA sequences, in this case your EDTA made TE library
-sp [P|M|F|O]    P: Plants, M: Metazoans, F: Fungi and O: Others. This was a teleost fish species so M was used
-m [P|M|F|P|U]   This argument directly downloads the desired model directory if -m_dir is used users will need to download the model directory themselves

Other settings are available, see https://github.com/LiLabAtVT/DeepTE

This was tested with Linux Ubuntu (v18.04.5), 32 cores, 64 threads, 128GB RAM on an EDTA made library (size 8.6MB). On this system with this EDTA library, DeepTE ran in under 12 hours.

Header Clean-Up

The headers in the output from DeepTE contain some attempted classifications from EDTA that are now surplus to requirement.
For example:

Example headers from the raw EDTA library output:
>TE_00000000#Unknown
>TE_00000001#Unknown
>TE_00000002#Unknown
>TE_00000003#Unknown
>TE_00000004#Unknown

Example headers from the EDTA/DeepTE library output:
>TE_00000000#Unknown__ClassI_LTR_Gypsy
>TE_00000001#Unknown__ClassI_LTR_Copia
>TE_00000002#Unknown__ClassI_LTR_Gypsy
>TE_00000003#Unknown__ClassI_LTR_Gypsy
>TE_00000004#Unknown__ClassI_LTR_Gypsy

Headers can be simplified by running the following bash command to clean up the library headers. This makes downstream screening and analysis easier.

bash
sed -e 's/\(#\).*\(__\)/\1\2/'  [path to DeepTE.fasta] > [path to cleaned up library]

Example headers in cleaned library:
>TE_00000000#__ClassI_LTR_Gypsy
>TE_00000001#__ClassI_LTR_Copia
>TE_00000002#__ClassI_LTR_Gypsy
>TE_00000003#__ClassI_LTR_Gypsy
>TE_00000004#__ClassI_LTR_Gypsy

Part 2: Screening for TEs

TE Screening with RepeatMasker

Now that a de novo TE library has been produced it can be used in conjunction with RepeatMasker to screen for TEs.

[pathway to RepeatMasker]RepeatMasker [pathway to the FASTA genome/transcriptome to be screened] -pa 48 -s -a -lib [pathway to the final EDTA/DeepTE FASTA library] -dir .

-pa           Gives the number of processess to use in parallel, in this case 48
-s [s|q|qq]   RepeatMasker is able to operate at different sensitivities/speeds with -q providing a quick, less sensitive screening and -s providing a slow and more sensivite screening, we used this more sensitive screening option
-a            Is an output option that shows alignments in a .align output file
-lib          Specifies that there is a de novo repeat library you wish to use 

Other settings are available, see https://www.repeatmasker.org

RepeatMasker Output Clean-Up

RepeatMasker uses asterisks in its .out file to label repeats that overlap with one or more other hits that have a higher score. To create a list of distinct repeat hits the following bash command can be used to remove lines with an asterisk in them.

bash
awk '!/\*/' [repeatmasker.out] > [noasterisk_repeatmasker.out]

When using de novo libraries RepeatMasker sometimes also adds a superfluous -int notation to the TE name which can interfere with downstream parsing, these can be removed with the following bash command.

bash
sed 's/-int//' [noasterisk_repeatmasker.out] > [tidy_noasterisk_repeatmasker.out]

Parsing RepeatMasker Output with RM_TRIPS

Cleaned RepeatMasker output files will need to be further parsed prior to any downstream analysis of TE content. We recommend the use of RM_TRIPS which is an R based parse script that will; (i) remove repetitive elements not classed as TEs, (ii) merge closely positioned TE fragments of matching identity, (iii) remove duplicated isoforms (from transcriptomic data) and, (iv) remove fragments less then 80 base pairs long. It then outputs a .csv file which can be input for downstream applications.

Screenshot 2021-06-23 at 16 21 35

To run RM_TRIPS first download and open the RM_TRIPS scripts (ideally in R studio).

Lines 10 to 13 of the R script should then be modified, as shown:

### set up inputs
i <- '[Directory for output files]' #directory where .out file is located
j <- '[tidy_noasterisk_repeatmasker.out]' #set name of file
k <- '[Directory of the final cleaned TE library]' #directory where the repeatmasker library is found (.lib/fasta file)
l <- '[cleaned_denovo_TE_lib.fasta]'  #set name of .lib file

The RM_TRIPS script may now be run through sequentially and a .csv file of parsed RepeatMasker outputs will be produced in the specified output directory.

Parsed RM_TRIPS Output

The .csv file produced by RM_TRIPS has 13 column headers with the following descriptors.

This output is now ready for use in downstream TE analysis!

Column Header Description
repeat_id Name of TE with the significant hit
try_id Name of scaffold or transcript with a TE hit
matching_repeat Is match complement (C) of the TE sequence?
matching_class The transposon class to which the TE belongs
reference_length Sequence length of the TE as found in the reference library
merged_qrystart Start of TE hit found on the transcript
merged_qryend End of TE hit found on the transcript
mergedfraglength Sequence length of TE hit (bp)
perc_div % of substitutions in matching region compared to the consensus
perc_del % of bases opposite a gap in the query sequence
perc_insert % of bases opposite a gap in the repeat sequence
Gene Gene name
Isoform Isoform number

Contact

For questions or queries please contact:

Ellen A Bell - [email protected]
Christopher L Butler - [email protected]
Martin I Taylor - [email protected]

Citations

Bell, E., Butler, C., Oliveira. C., Marburger, S., Yant, L. & Taylor, M., (2021). Transposable element annotation in non‐model species ‐ the benefits of species‐specific repeat libraries using semi‐automated EDTA and DeepTE de novo pipelines. Molecular Ecology Resources, 22(2), 823-833. dio: 10.1111/1755-0998.13489

Ou, S., Su, W., Liao, Y., Chougule, K., Agda, J. R.A., Hellinga, A. J., …Hufford, M. B. (2019). Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology, 20(1), 1–18. doi: 10.1186/s13059-019-1905-y.

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013-2015 http://www.repeatmasker.org.

Smit, A. F. & Hubley, R. RepeatModeler Open-1.0. Available from: http://www.repeatmasker.org.

Yan, H., Bombarely, A. & Li, S. (2020). DeepTE: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics, 36(15), 4269–4275. doi: 10.1093/bioinformatics/btaa519.

About

A quick user guide for de novo transposable element (TE) library generation and TE screening. Utilising; the Extensive de novo TE Annotator (EDTA), DeepTE, RepeatMasker and RM_TRIPS.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages