GFFPal

A quick and flexible toolset/python library for working with GFF files.

I'm often needing to convert formats into or from GFF3.

Genometools and Aegean are useful tools for some aspects of GFF manipulation (especially tidying and conversion to GTF). But they often aren't great at handling non-specifcation compliant GFF3s (i.e. CDS attached directly to genes, "match" style GFFs, or augustus hints GFFs). Also I wanted something that I could use to select by ID (instead of tricky awk or grep -f commands).

This is intended for my own use and I'm motivated by ease of use rather than computational effeciency (hence Python). But if you have thoughts, contributions, or whatever, please raise an issue to get in touch.

Installing

At the moment things tend to just be written directly to the master github branch and I don't do much versioning.

Until that changes, i'd suggest installing from here :) If you want versioned point releases let me know.

If you have git and python3 installed you can do:

python3 -m pip install git+https://github.com/darcyabjones/gffpal.git

If you aren't using a virtualenv or conda env, i'd suggest using the --user option to avoid messing with your root packages.

python3 -m pip install --user git+https://github.com/darcyabjones/gffpal.git

There are ways to ask it to install specific commits, or to use the master zip if you don't have git installed. Create an issue if you feel it would help you.

Available scripts

All scripts are subcommands under the gffpal command. To view a list of subcommands type gffpal -h.

`gffpal hints`

Converts gffs in conventional gene formats into "hints" for the augustus gene prediction toolset. This is similar in spirit to the align2hints.pl script in BRAKER but we make no effort to add new features (e.g. start, stop), or parse the various non-standard formats. It's a simple conversion of the type columns, grouping genes using the group attribute, and adding priorities etc. To add additional features (e.g. start, stop, UTR etc), I recommend canon-gff3 from AEGeAn.

Basic usage:

gffpal hints \
  --outfile "my_hints.gff3" \
  --source "M" \
  --priority 4 \
  --group-level "mRNA" \
  --cds "CDSpart" \
  --cds-trim 12 \
  genes_to_create_hints_from.gff3

`gffpal expandcds`

A utility to fix wonky CDS features in GFF3 files. The GFF3 standard says that start and stop codons should be included in the CDS ranges but many tools exclude the stop codon.

This script simply expands the CDS by 3 bases to include the start and/or codons. Optionally if you provide the GFFs corresponding genome fasta we will print out warnings for all of the non-standard start/stop codons according to some translation table. So you can evaluate if you're actually doing the right thing :)

NOTE: at the moment the start/stop extension and extraction is really stupid. We simply bump the number from the relevant CDS section by three. This could give incorrect results for cases where there is an intron within the start/stop codon. Future versions might use exon information to handle this, but for now it's an acceptable limitation for my use case (Fixing spaln output prior to augustus hint generation).

Basic usage:

gffpal expandcds \
  --outfile "expanded.gff3" \
  --infasta "my_genome.fasta" \
  --gencode 1 \
  --warnings "non_standard.gff3" \
  "my_genome.gff3"

`gffpal rnammer2gff`

Converts RNAmmer gff2 output to gff3 with the correct "type" column set according to the Sequence Ontology.

Basic usage:

rnammer -S euk -m lsu,ssu,tsu -gff "rnammer.gff2" my.fasta
gffpal rnammer2gff -o rnammer.gff3 -k euk rnammer.gff2

`gffpal trnascan2gff`

Converts tRNAscan output to gff3 with the correct "types" column set according to the Sequence Ontology and including codon/secondary structure information in the attributes..

Basic usage:

tRNAscan-SE -E -o "trnascan.txt" -f "trnascan_ss.txt" "my.fasta"

gffpal trnascan2gff -o "trnascan.gff3" trnascan.txt trnascan_ss.txt

`gffpal exonerate2gff`

Converts exonerates weird gff2 format to a GFF3 with unique ids for each block. This is a lossy process, some of the ids that exonerate generates are lost (but the query protein is retained). It's only known to work on the protein to genome alignments, so others might have issues.

The source is always set to "exonerate". "exon" features are used as "CDSs" since protein alignments should only give that and they have the match statistics.

Basic usage:

gffpal exonerate2gff -o "exonerate.gff3" exonerate.gff2

`gffpal coord2contig`

Coord to contig attempts to find the best non-overlapping set of contigs given a nucmer coords alignment to the scaffolds of the same assembly.

Essentially it just iteratively selects alignments based on their length, % identity, contig coverage, and how much they overlap with other contig alignments. The remaining overlapping alignments are split at the intersection of the two based on the following:

If there is a stretch of N-s in the intersection, split the alignment intersection at the one closest to one of the alignment ends.
If one (and only one) of the contig alignments goes to the end of the contig, then the whole alignment intersection gets assigned to that one and the two contigs will "butt" against each other without an N-stretch.
If there is an n-stretch immediately next to the intersection, the one that is unbroken gets the region.
Otherwise, extract the sequences corresponding to the intersection for each contig, locally align it to the scaffold section, and take the highest scoring match. In the case of a tie, the left-most contig will win.

Optionally, with the --extend parameter, the program will try to extend the "contigs" to fill space between contigs or the ends of sequences, trim any contig ends at N's, and add contigs to fill in any stretches of unassigned genome that don't overlap an N-stretch longer than specified by the --nstretch parameter. Effectively this becomes similar to the old strategy of splitting the genome at N-stretches, but you won't necessarily split contigs that have good alignments crossing them and you'll catch cases of butting contigs with no Ns separating them.

Note that the SPAdes assembler doesn't give a nice 1-1 contig tiling path, some of the contigs overlap, and sometimes it will add some extra sequence between the contigs. The extend option can be useful for finding something approximating a tiling path.

To use, align your scaffolds and contigs using Nucmer and get the coords file. I used MUMmer version 4.0.0beta2.

nucmer --breaklen 400 --threads 4 \
  --delta genome.delta \
  scaffolds.fasta \
  contigs.fasta

show-coords -T -c -l -o genome.delta > genome.coords

Then you can run coord2contig like so:

gffpal coord2contig -o contig.gff3 genome.coords scaffolds.fasta contigs.fasta

You can also get the good matches from the coords file as a gff (before trying to resolve overlaps) using the --matches flag.

Mummer alignments will be given the type nucleotide_match (SO:0000347). Contigs will be given the contig type (SO:0000149), and n-stretches will have the gap type (SO:0000730).

Use gffpal coord2contig --help for full options.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
gffpal		gffpal
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GFFPal

Installing

Available scripts

`gffpal hints`

`gffpal expandcds`

`gffpal rnammer2gff`

`gffpal trnascan2gff`

`gffpal exonerate2gff`

`gffpal coord2contig`

About

Releases

Packages

Languages

License

darcyabjones/gffpal

Folders and files

Latest commit

History

Repository files navigation

GFFPal

Installing

Available scripts

gffpal hints

gffpal expandcds

gffpal rnammer2gff

gffpal trnascan2gff

gffpal exonerate2gff

gffpal coord2contig

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

`gffpal hints`

`gffpal expandcds`

`gffpal rnammer2gff`

`gffpal trnascan2gff`

`gffpal exonerate2gff`

`gffpal coord2contig`

Packages