From 4bd8c87b408d11bcf6394ec09af46832be20dadf Mon Sep 17 00:00:00 2001 From: peterjc Date: Fri, 29 Apr 2016 16:11:54 +0100 Subject: [PATCH] Print functions and back-tick markup for AlignIO page etc See #47. --- wiki/AlignIO.md | 157 ++++++++++++++++++++++++------------------------ 1 file changed, 80 insertions(+), 77 deletions(-) diff --git a/wiki/AlignIO.md b/wiki/AlignIO.md index 009fd9c96..e510e28b9 100644 --- a/wiki/AlignIO.md +++ b/wiki/AlignIO.md @@ -6,15 +6,15 @@ tags: - Wiki Documentation --- -This page describes Bio.AlignIO, a new multiple sequence Alignment +This page describes `Bio.AlignIO`, a new multiple sequence Alignment Input/Output interface for BioPython 1.46 and later. In addition to the built in API documentation, there is a whole chapter in the [Tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) on Bio.AlignIO, and although there is some overlap it is well worth -reading in addition to this WIKI page. There is also the [API +reading in addition to this page. There is also the [API documentation](http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html) -(which you can read online, or from within Python with the help +(which you can read online, or from within Python with the `help()` command). Aims @@ -23,21 +23,21 @@ Aims You may already be familiar with the [Bio.SeqIO](SeqIO "wikilink") module which deals with files containing one or more sequences represented as [SeqRecord](SeqRecord "wikilink") objects. The purpose of -the SeqIO module is to provide a simple uniform interface to assorted +the `SeqIO` module is to provide a simple uniform interface to assorted sequence file formats. -Similarly, Bio.AlignIO deals with files containing one or more sequence -alignments represented as Alignment objects. Bio.AlignIO uses the same -set of functions for input and output as in Bio.SeqIO, and the same +Similarly, `Bio.AlignIO` deals with files containing one or more sequence +alignments represented as Alignment objects. `Bio.AlignIO` uses the same +set of functions for input and output as in `Bio.SeqIO`, and the same names for the file formats supported. -Note that the inclusion of Bio.AlignIO does lead to some duplication or -choice in how to deal with some file formats. For example, Bio.AlignIO -and Bio.Nexus will both read alignments from NEXUS files - but Bio.NEXUS -allows more control and the use of trees. +Note that the inclusion of `Bio.AlignIO` does lead to some duplication or +choice in how to deal with some file formats. For example, `Bio.AlignIO` +and `Bio.Nexus` will both read alignments from NEXUS files - but +`Bio.NEXUS` allows more control and the use of trees. My vision is that for reading or writing sequence alignments you should -try Bio.AlignIO as your first choice. In some cases you may only care +try `Bio.AlignIO` as your first choice. In some cases you may only care about the sequences themselves, in which case try using [Bio.SeqIO](SeqIO "wikilink") on the alignment file directly. Unless you have some very specific requirements, I hope this should suffice. @@ -98,48 +98,50 @@ Fib\_gamma](http://pfam.sanger.ac.uk/family?acc=PF09395). At the time of writing, this contained 14 sequences with an alignment length of 77 amino acids, and is shown below in the PFAM or Stockholm format: - # STOCKHOLM 1.0 - #=GS Q7ZVG7_BRARE/37-110 AC Q7ZVG7.1 - #=GS Q6X871_SCAAQ/1-77 AC Q6X871.1 - #=GS O02676_CROCR/1-77 AC O02676.1 - #=GS Q6X869_TENEC/1-77 AC Q6X869.1 - #=GS FIBG_HUMAN/40-116 AC P02679.3 - #=GS O02689_TAPIN/1-77 AC O02689.1 - #=GS O02688_PIG/1-77 AC O02688.1 - #=GS O02672_9CETA/1-77 AC O02672.1 - #=GS O02682_EQUPR/1-77 AC O02682.1 - #=GS Q6X870_CYNVO/1-77 AC Q6X870.1 - #=GS FIBG_RAT/40-116 AC P02680.3 - #=GS Q6X866_DROAU/1-76 AC Q6X866.1 - #=GS O93568_CHICK/40-116 AC O93568.1 - #=GS FIBG_XENLA/38-114 AC P17634.1 - Q7ZVG7_BRARE/37-110 GFGTYCPTTCGVADYLQRYKPDMDKKLDDMEQDLEEIANLTRGAQDKVVYLK---DSEAQAQKQSPDTYIKKSSNML - Q6X871_SCAAQ/1-77 RFGSYCPTTCGIADFLSTYQATVDKDLQTLEDILSQAENKTMEAKELVKAIQVSYLPEDPARPNRVELATKDSKKMM - O02676_CROCR/1-77 RFGSYCPTTCGIADFLSTYQTGVXNDLRTLEDLLSGIENKTSEAKELIKSIQVSYNPNEPPKPNTIVSATKDSKKMM - Q6X869_TENEC/1-77 RFGSYCPTTCGIADFLSTYQGSIDKDLQTLEDILNQVENKTXEASELIKSIQVSYNPDEPPRPNMIEGATQKSKKML - FIBG_HUMAN/40-116 RFGSYCPTTCGIADFLSTYQTKVDKDLQSLEDILHQVENKTSEVKQLIKAIQLTYNPDESSKPNMIDAATLKSRKML - #=GS FIBG_HUMAN/40-116 DR PDB; 1qvh L;14-45 - #=GS FIBG_HUMAN/40-116 DR PDB; 1fza C;88-90 - #=GS FIBG_HUMAN/40-116 DR PDB; 1fzb C;88-90 - #=GS FIBG_HUMAN/40-116 DR PDB; 1fzb F;88-90 - #=GS FIBG_HUMAN/40-116 DR PDB; 1qvh I;14-45 - #=GS FIBG_HUMAN/40-116 DR PDB; 1fza F;88-90 - #=GR FIBG_HUMAN/40-116 SS CCXCXBXXHHHHHHHHHHHHHHHHHHHHHHHXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX-CC - O02689_TAPIN/1-77 RFGSYCPTTCGIADFLSTYQTXVDKDLQVLEDILNQAENKTSEAKELIKAIQVRYKPDEPTKPGGIDSATRESKKML - O02688_PIG/1-77 RFGSYCPTMCGIAGFLSTYQNTVEKDLQNLEGILHQVENKTSEARELIKAIQISYNPEDLSKPDRIQSATKESKKML - O02672_9CETA/1-77 RFGSYCPTTCGVADFLSNYQTSVDKDLQNLEGILYQVENKTSEARELVKAIQISYNPDEPSKPNNIESATKNSKRMM - O02682_EQUPR/1-77 RFGSYCPTTCGIADFLSNYQTSVDKDLQDFEDILHRAENQTSEAEQLIQAIRTSYNPDEPPKTGRIDAATRESKKMM - Q6X870_CYNVO/1-77 RFGSYCPTTCGIADFLSTYQTKVDEDLQNLEDILYRVENRTSEAKELIKAIQVDYNPGEPPKQSVTEGATQNAKKMV - FIBG_RAT/40-116 RFGSYCPTTCGISDFLNSYQTDVDTDLQTLENILQRAENRTTEAKELIKAIQVYYNPDQPPKPGMIEGATQKSKKMV - Q6X866_DROAU/1-76 RFGSYCPTTCGIADFLNKYQTTIDQDLRHMEETLRDIDNKTAESTLLIQKIQIGQTPDPRPQ-NVIGDVTQKSRKMI - O93568_CHICK/40-116 RFGSYCPTTCGIADFFNKYRLTTDGELLEIEGLLQQATNSTGSIEYLIQHIKTIYPSEKQTLPQSIEQLTQKSKKII - #=GS O93568_CHICK/40-116 DR PDB; 1m1j F;14-90 - #=GS O93568_CHICK/40-116 DR PDB; 1m1j C;14-90 - #=GR O93568_CHICK/40-116 SS CCEEEEE-CCCCCCCCCCCCCHHHCCCCCHHHHHHHHHHHHHHHCCCCCCHHHHS-SSTT--SS-HHHHHHHHHHHH - FIBG_XENLA/38-114 RFGEYCPTTCGISDFLNRYQENVDTDLQYLENLLTQISNSTSGTTIIVEHLIDSGKKPATSPQTAIDPMTQKSKTCW - #=GC SS_cons CCECEEE-CCCCCCCCCCCCCHHHCCCCCHHHHHHHHHHHHHHHCCCCCCHHHHS-SSTT--SS-HHHHHHHHHHCC - #=GC seq_cons RFGSYCPTTCGIADFLSsYQssVDcDLQsLEsILpplEN+ToEAc-LIKuIQlsYsP--ss+PstI-uATpcSKKMl - // +``` +# STOCKHOLM 1.0 +#=GS Q7ZVG7_BRARE/37-110 AC Q7ZVG7.1 +#=GS Q6X871_SCAAQ/1-77 AC Q6X871.1 +#=GS O02676_CROCR/1-77 AC O02676.1 +#=GS Q6X869_TENEC/1-77 AC Q6X869.1 +#=GS FIBG_HUMAN/40-116 AC P02679.3 +#=GS O02689_TAPIN/1-77 AC O02689.1 +#=GS O02688_PIG/1-77 AC O02688.1 +#=GS O02672_9CETA/1-77 AC O02672.1 +#=GS O02682_EQUPR/1-77 AC O02682.1 +#=GS Q6X870_CYNVO/1-77 AC Q6X870.1 +#=GS FIBG_RAT/40-116 AC P02680.3 +#=GS Q6X866_DROAU/1-76 AC Q6X866.1 +#=GS O93568_CHICK/40-116 AC O93568.1 +#=GS FIBG_XENLA/38-114 AC P17634.1 +Q7ZVG7_BRARE/37-110 GFGTYCPTTCGVADYLQRYKPDMDKKLDDMEQDLEEIANLTRGAQDKVVYLK---DSEAQAQKQSPDTYIKKSSNML +Q6X871_SCAAQ/1-77 RFGSYCPTTCGIADFLSTYQATVDKDLQTLEDILSQAENKTMEAKELVKAIQVSYLPEDPARPNRVELATKDSKKMM +O02676_CROCR/1-77 RFGSYCPTTCGIADFLSTYQTGVXNDLRTLEDLLSGIENKTSEAKELIKSIQVSYNPNEPPKPNTIVSATKDSKKMM +Q6X869_TENEC/1-77 RFGSYCPTTCGIADFLSTYQGSIDKDLQTLEDILNQVENKTXEASELIKSIQVSYNPDEPPRPNMIEGATQKSKKML +FIBG_HUMAN/40-116 RFGSYCPTTCGIADFLSTYQTKVDKDLQSLEDILHQVENKTSEVKQLIKAIQLTYNPDESSKPNMIDAATLKSRKML +#=GS FIBG_HUMAN/40-116 DR PDB; 1qvh L;14-45 +#=GS FIBG_HUMAN/40-116 DR PDB; 1fza C;88-90 +#=GS FIBG_HUMAN/40-116 DR PDB; 1fzb C;88-90 +#=GS FIBG_HUMAN/40-116 DR PDB; 1fzb F;88-90 +#=GS FIBG_HUMAN/40-116 DR PDB; 1qvh I;14-45 +#=GS FIBG_HUMAN/40-116 DR PDB; 1fza F;88-90 +#=GR FIBG_HUMAN/40-116 SS CCXCXBXXHHHHHHHHHHHHHHHHHHHHHHHXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX-CC +O02689_TAPIN/1-77 RFGSYCPTTCGIADFLSTYQTXVDKDLQVLEDILNQAENKTSEAKELIKAIQVRYKPDEPTKPGGIDSATRESKKML +O02688_PIG/1-77 RFGSYCPTMCGIAGFLSTYQNTVEKDLQNLEGILHQVENKTSEARELIKAIQISYNPEDLSKPDRIQSATKESKKML +O02672_9CETA/1-77 RFGSYCPTTCGVADFLSNYQTSVDKDLQNLEGILYQVENKTSEARELVKAIQISYNPDEPSKPNNIESATKNSKRMM +O02682_EQUPR/1-77 RFGSYCPTTCGIADFLSNYQTSVDKDLQDFEDILHRAENQTSEAEQLIQAIRTSYNPDEPPKTGRIDAATRESKKMM +Q6X870_CYNVO/1-77 RFGSYCPTTCGIADFLSTYQTKVDEDLQNLEDILYRVENRTSEAKELIKAIQVDYNPGEPPKQSVTEGATQNAKKMV +FIBG_RAT/40-116 RFGSYCPTTCGISDFLNSYQTDVDTDLQTLENILQRAENRTTEAKELIKAIQVYYNPDQPPKPGMIEGATQKSKKMV +Q6X866_DROAU/1-76 RFGSYCPTTCGIADFLNKYQTTIDQDLRHMEETLRDIDNKTAESTLLIQKIQIGQTPDPRPQ-NVIGDVTQKSRKMI +O93568_CHICK/40-116 RFGSYCPTTCGIADFFNKYRLTTDGELLEIEGLLQQATNSTGSIEYLIQHIKTIYPSEKQTLPQSIEQLTQKSKKII +#=GS O93568_CHICK/40-116 DR PDB; 1m1j F;14-90 +#=GS O93568_CHICK/40-116 DR PDB; 1m1j C;14-90 +#=GR O93568_CHICK/40-116 SS CCEEEEE-CCCCCCCCCCCCCHHHCCCCCHHHHHHHHHHHHHHHCCCCCCHHHHS-SSTT--SS-HHHHHHHHHHHH +FIBG_XENLA/38-114 RFGEYCPTTCGISDFLNRYQENVDTDLQYLENLLTQISNSTSGTTIIVEHLIDSGKKPATSPQTAIDPMTQKSKTCW +#=GC SS_cons CCECEEE-CCCCCCCCCCCCCHHHCCCCCHHHHHHHHHHHHHHHCCCCCCHHHHS-SSTT--SS-HHHHHHHHHHCC +#=GC seq_cons RFGSYCPTTCGIADFLSsYQssVDcDLQsLEsILpplEN+ToEAc-LIKuIQlsYsP--ss+PstI-uATpcSKKMl +// +``` You will notice that there is plenty of annotation information here, including accession numbers for each sequence and also some PDB database @@ -149,53 +151,54 @@ chick fibrinogen proteins. This file contains a single alignment, so we can use the `Bio.AlignIO.read()` function to load it in Biopython. Let's assume you have downloaded this alignment from Sanger, or have copy and pasted -the text above, and saved this as a file called `PF09395\_seed.sth` on +the text above, and saved this as a file called `PF09395_seed.sth` on your computer. Then in python: ``` python from Bio import AlignIO alignment = AlignIO.read(open("PF09395_seed.sth"), "stockholm") -print "Alignment length %i" % alignment.get_alignment_length() +print("Alignment length %i" % alignment.get_alignment_length()) for record in alignment : - print record.seq, record.id + print(record.seq + " " + record.id) ``` That should give: - Alignment length 77 - GFGTYCPTTCGVADYLQRYKPDMDKKLDDMEQDLEEIANLTRGAQDKVVYLK---DSEAQAQKQSPDTYIKKSSNML Q7ZVG7_BRARE/37-110 - RFGSYCPTTCGIADFLSTYQATVDKDLQTLEDILSQAENKTMEAKELVKAIQVSYLPEDPARPNRVELATKDSKKMM Q6X871_SCAAQ/1-77 - RFGSYCPTTCGIADFLSTYQTGVXNDLRTLEDLLSGIENKTSEAKELIKSIQVSYNPNEPPKPNTIVSATKDSKKMM O02676_CROCR/1-77 - RFGSYCPTTCGIADFLSTYQGSIDKDLQTLEDILNQVENKTXEASELIKSIQVSYNPDEPPRPNMIEGATQKSKKML Q6X869_TENEC/1-77 - RFGSYCPTTCGIADFLSTYQTKVDKDLQSLEDILHQVENKTSEVKQLIKAIQLTYNPDESSKPNMIDAATLKSRKML FIBG_HUMAN/40-116 - RFGSYCPTTCGIADFLSTYQTXVDKDLQVLEDILNQAENKTSEAKELIKAIQVRYKPDEPTKPGGIDSATRESKKML O02689_TAPIN/1-77 - RFGSYCPTMCGIAGFLSTYQNTVEKDLQNLEGILHQVENKTSEARELIKAIQISYNPEDLSKPDRIQSATKESKKML O02688_PIG/1-77 - RFGSYCPTTCGVADFLSNYQTSVDKDLQNLEGILYQVENKTSEARELVKAIQISYNPDEPSKPNNIESATKNSKRMM O02672_9CETA/1-77 - RFGSYCPTTCGIADFLSNYQTSVDKDLQDFEDILHRAENQTSEAEQLIQAIRTSYNPDEPPKTGRIDAATRESKKMM O02682_EQUPR/1-77 - RFGSYCPTTCGIADFLSTYQTKVDEDLQNLEDILYRVENRTSEAKELIKAIQVDYNPGEPPKQSVTEGATQNAKKMV Q6X870_CYNVO/1-77 - RFGSYCPTTCGISDFLNSYQTDVDTDLQTLENILQRAENRTTEAKELIKAIQVYYNPDQPPKPGMIEGATQKSKKMV FIBG_RAT/40-116 - RFGSYCPTTCGIADFLNKYQTTIDQDLRHMEETLRDIDNKTAESTLLIQKIQIGQTPDPRPQ-NVIGDVTQKSRKMI Q6X866_DROAU/1-76 - RFGSYCPTTCGIADFFNKYRLTTDGELLEIEGLLQQATNSTGSIEYLIQHIKTIYPSEKQTLPQSIEQLTQKSKKII O93568_CHICK/40-116 - RFGEYCPTTCGISDFLNRYQENVDTDLQYLENLLTQISNSTSGTTIIVEHLIDSGKKPATSPQTAIDPMTQKSKTCW FIBG_XENLA/38-114 +``` +Alignment length 77 +GFGTYCPTTCGVADYLQRYKPDMDKKLDDMEQDLEEIANLTRGAQDKVVYLK---DSEAQAQKQSPDTYIKKSSNML Q7ZVG7_BRARE/37-110 +RFGSYCPTTCGIADFLSTYQATVDKDLQTLEDILSQAENKTMEAKELVKAIQVSYLPEDPARPNRVELATKDSKKMM Q6X871_SCAAQ/1-77 +RFGSYCPTTCGIADFLSTYQTGVXNDLRTLEDLLSGIENKTSEAKELIKSIQVSYNPNEPPKPNTIVSATKDSKKMM O02676_CROCR/1-77 +RFGSYCPTTCGIADFLSTYQGSIDKDLQTLEDILNQVENKTXEASELIKSIQVSYNPDEPPRPNMIEGATQKSKKML Q6X869_TENEC/1-77 +RFGSYCPTTCGIADFLSTYQTKVDKDLQSLEDILHQVENKTSEVKQLIKAIQLTYNPDESSKPNMIDAATLKSRKML FIBG_HUMAN/40-116 +RFGSYCPTTCGIADFLSTYQTXVDKDLQVLEDILNQAENKTSEAKELIKAIQVRYKPDEPTKPGGIDSATRESKKML O02689_TAPIN/1-77 +RFGSYCPTMCGIAGFLSTYQNTVEKDLQNLEGILHQVENKTSEARELIKAIQISYNPEDLSKPDRIQSATKESKKML O02688_PIG/1-77 +RFGSYCPTTCGVADFLSNYQTSVDKDLQNLEGILYQVENKTSEARELVKAIQISYNPDEPSKPNNIESATKNSKRMM O02672_9CETA/1-77 +RFGSYCPTTCGIADFLSNYQTSVDKDLQDFEDILHRAENQTSEAEQLIQAIRTSYNPDEPPKTGRIDAATRESKKMM O02682_EQUPR/1-77 +RFGSYCPTTCGIADFLSTYQTKVDEDLQNLEDILYRVENRTSEAKELIKAIQVDYNPGEPPKQSVTEGATQNAKKMV Q6X870_CYNVO/1-77 +RFGSYCPTTCGISDFLNSYQTDVDTDLQTLENILQRAENRTTEAKELIKAIQVYYNPDQPPKPGMIEGATQKSKKMV FIBG_RAT/40-116 +RFGSYCPTTCGIADFLNKYQTTIDQDLRHMEETLRDIDNKTAESTLLIQKIQIGQTPDPRPQ-NVIGDVTQKSRKMI Q6X866_DROAU/1-76 +RFGSYCPTTCGIADFFNKYRLTTDGELLEIEGLLQQATNSTGSIEYLIQHIKTIYPSEKQTLPQSIEQLTQKSKKII O93568_CHICK/40-116 +RFGEYCPTTCGISDFLNRYQENVDTDLQYLENLLTQISNSTSGTTIIVEHLIDSGKKPATSPQTAIDPMTQKSKTCW FIBG_XENLA/38-114 +``` Alignment Output ---------------- As in [Bio.SeqIO](SeqIO "wikilink"), there is a single output function -**Bio.AlignIO.write()**. This takes three arguments: some alignments, a +`Bio.AlignIO.write()`. This takes three arguments: some alignments, a file handle to write to, and the format to use. -As of Biopython 1.48, the alignment object acquired a **format()** +As of Biopython 1.48, the alignment object acquired a `.format()` method to give a string containing the alignment in the specified file format, e.g. ``` python AlignIO.read(open("PF09395_seed.sth"), "stockholm") -print alignment.format("fasta") +print(alignment.format("fasta")) ``` -This wiki section needs to be filled out, so in the short term please -refer to the Bio.AlignIO chapter in the Tutorial. +Please refer to the Bio.AlignIO chapter in the Tutorial for more details. File Format Conversion ----------------------