TEGym

Build your own deep learning transposable elements classifier with ease.

What is TEGym?

TEGym is a program written in python to help people without deep learning expertise create a data driven transposable elements classifier, i.e., a model to classify transposable elements of species lacking enough data to train a classifier, using the data from a more closely related species. It automatizes preprocessing, hyperparameters testing and model training, resulting in a classifier suited for the needs of the user. Although TEGym was developed with transposable elements in mind, it can probably be used for other sequence classification tasks.

For a better explanation on how to use the program, check the manual in PDF.

TEGym is a work in progress to the date of this writing. We are trying to add more options and improvements as soon as possible.

Installation

TEGym uses python version 3.11. Preferentially use python version >= 3.10 in a python virtual enviroment or a conda enviroment. Install the required packages using:

pip install -r requirements.txt

Basic usage

The most basic usage is simple. You only need a FASTA file or a CSV file contaning the sequences and the labels. The CSV table must contain the columns named label and sequences. The FASTA header/id must be in the RepeatMasker format (sequenceID#label).

Example:

python gym.py -f my_file.fasta

or

python gym.py -c my_file.csv

The initial phase involves searching for the optimal hyperparameters to train the model based on the input dataset. Then, the model will be trained using the best combination of hyperparameters, determined by the lowest validation loss.

Running steps independently

Instead of running hyperparameter search and model training all at once, you can run the steps independently. Just call the script hyperparameters.py to generate the CSV to be used for model training later. Then, when calling gym.py use the flag -p to indicate the path to the hyperparameter’s CSV.

Example:

python hyperparameter.py -f my_file.fasta
python gym.py -f my_file.fasta -p TEGym_hyperparameters.csv

Customizing hyperparameters search

If you want to set-up other values for hyperparameter searching different than the default values used by TEGym, you just need to modify the values in the TOML file my_config_hyperparameters.toml.

Do NOT change the name of values before the = sign, just the values inside square brackets, which must be comma separated.

Other flags

You can view other flags and their usage by running:

python gym.py --help

or

python hyperparameters.py --help

usage: gym.py [-h] (-f FASTA | -c CSV) [-p HYPER] [-m METRIC] [-t TITLE] [-r RUNS] [-s SPLIT]

Train your own classifier model.

options:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        Input fasta file with id and labels formatted as: ">seqId#Label".
  -c CSV, --csv CSV     Input CSV file containing columns "label" and "sequence".
  -p HYPER, --hyper HYPER
                        CSV file containing the hyperparametere metrics.
  -m METRIC, --metric METRIC
                        choose hyperparameters based on metric. Values are "val_loss" (default) or "val_accuracy".
  -t TITLE, --title TITLE
                        Model identifier (optional).
  -r RUNS, --runs RUNS  number of runs (tests) to find the hyperparameters.
  -s SPLIT, --split SPLIT
                        Portion of the dataset to use as validation set. The major portion is used for model training. Default=0.1.

FASTA to CSV

When using a FASTA file as input, the program will convert it to a CSV file. Depending on the size of your FASTA, it may be time-consuming. You can convert you FASTA to CSV prior to running the program using the script fasta_to_csv.py as follows:

python fasta_to_csv.py my_file.fasta

Predict

After your model is trained using gym.py, you can use it as a classifier by running the script predict.py. It has three mandatory arguments: an FASTA file with sequences to be classified, the path to the trained model and the path to the TOML file with model info.

python predict.py -f file.fasta -m my_model.keras -i my_model_info.toml

The output is a CSV file containing the classification prediction for each sequence and the classication score ranging from 0 to 1.

id	prediction	TE_score	NonTE_score
Seq01	TE	0.98	0.02
Seq02	TE	0.72	0.28
Seq03	NonTE	0.0	1.0
Seq04	NonTE	0.37	0.63
Seq05	TE	0.85	0.15

Create a negative class

If your dataset has only one class, for instance, only sequences labeled as TE, you can use the script create_negative_class.py to create another class to train your model. Use the values random or shuffled with the flag -c to create random sequences or shuffle your sequences, respectively.

Example:

python create_negative_class.py -f my_file.fasta -c shuffled.

The output is a CSV file with the prefix TDS containing your sequences and the newly created ones. Then you can use it with the main program.

Please, cite TEGym

If you have used it and have found it helpful and useful, please cite:

Minuzzi Freire da Fontoura Gomes, T. (2024). TEGym: Build your own deep learning transposable elements classifier with ease. Zenodo. https://doi.org/10.5281/zenodo.10891456

TO-DO

Generate random sequences if only one is class available.
Option to generate reverse complement.
Add example files.
Option to use k-mers.
Add GPU support.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Manual_TEGym.pdf		Manual_TEGym.pdf
README.md		README.md
create_negative_class.py		create_negative_class.py
fasta_to_csv.py		fasta_to_csv.py
gym.py		gym.py
hyperparameters.py		hyperparameters.py
my_config_hyperparameters.toml		my_config_hyperparameters.toml
neural_net.py		neural_net.py
predict.py		predict.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEGym

Table of contents

What is TEGym?

Installation

Basic usage

Running steps independently

Customizing hyperparameters search

Other flags

FASTA to CSV

Predict

Create a negative class

Please, cite TEGym

TO-DO

About

Releases 5

Packages

Languages

License

Tiago-Minuzzi/TEGym

Folders and files

Latest commit

History

Repository files navigation

TEGym

Table of contents

What is TEGym?

Installation

Basic usage

Running steps independently

Customizing hyperparameters search

Other flags

FASTA to CSV

Predict

Create a negative class

Please, cite TEGym

TO-DO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages