1
2
3
4
5
6
7
8
9 """Sequence input/output as SeqRecord objects.
10
11 The Bio.SeqIO module is also documented by a whole chapter in the Biopython
12 tutorial, and by the wiki http://biopython.org/wiki/SeqIO on the website.
13 The approach is designed to be similar to the bioperl SeqIO design.
14
15 Input
16 =====
17 The main function is Bio.SeqIO.parse(...) which takes an input file handle,
18 and format string. This returns an iterator giving SeqRecord objects.
19
20 from Bio import SeqIO
21 handle = open("example.fasta", "rU")
22 for record in SeqIO.parse(handle, "fasta") :
23 print record
24 handle.close()
25
26 Note that the parse() function will all invoke the relevant parser for the
27 format with its default settings. You may want more control, in which case
28 you need to create a format specific sequence iterator directly.
29
30 For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records
31 using a sequence iterator can save you a lot of memory (RAM). There is
32 less benefit for interlaced file formats (e.g. most multiple alignment file
33 formats). However, an iterator only lets you access the records one by one.
34
35 If you want random access to the records by number, turn this into a list:
36
37 from Bio import SeqIO
38 handle = open("example.fasta", "rU")
39 records = list(SeqIO.parse(handle, "fasta"))
40 handle.close()
41 print records[0]
42
43 If you want random access to the records by a key such as the record id,
44 turn the iterator into a dictionary:
45
46 from Bio import SeqIO
47 handle = open("example.fasta", "rU")
48 record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
49 handle.close()
50 print record["gi:12345678"]
51
52 If you expect your file to contain one-and-only-one record, then we provide
53 the following 'helper' function which will return a single SeqRecord, or
54 raise an exception if there are no records or more than one record:
55
56 from Bio import SeqIO
57 handle = open("example.fasta", "rU")
58 record = SeqIO.read(handle, "fasta")
59 handle.close()
60 print record
61
62 This style is useful when you expect a single record only (and would
63 consider multiple records an error). For example, when dealing with GenBank
64 files for bacterial genomes or chromosomes, there is normally only a single
65 record. Alternatively, use this with a handle when download a single record
66 from the internet.
67
68 However, if you just want the first record from a file containing multiple
69 record, use the iterator's next() method:
70
71 from Bio import SeqIO
72 handle = open("example.fasta", "rU")
73 record = SeqIO.parse(handle, "fasta").next()
74 handle.close()
75 print record
76
77 The above code will work as long as the file contains at least one record.
78 Note that if there is more than one record, the remaining records will be
79 silently ignored.
80
81 Input - Alignments
82 ==================
83 You can read in alignment files as Alignment objects using Bio.AlignIO.
84 Alternatively, reading in an alignment file format via Bio.SeqIO will give
85 you a SeqRecord for each row of each alignment.
86
87 Output
88 ======
89 Use the function Bio.SeqIO.write(...), which takes a complete set of
90 SeqRecord objects (either as a list, or an iterator), an output file handle
91 and of course the file format.
92
93 from Bio import SeqIO
94 records = ...
95 handle = open("example.faa", "w")
96 SeqIO.write(records, handle, "fasta")
97 handle.close()
98
99 In general, you are expected to call this function once (with all your
100 records) and then close the file handle.
101
102 Output - Advanced
103 =================
104 The effect of calling write() multiple times on a single file will vary
105 depending on the file format, and is best avoided unless you have a strong
106 reason to do so.
107
108 Trying this for certain alignment formats (e.g. phylip, clustal, stockholm)
109 would have the effect of concatenating several multiple sequence alignments
110 together. Such files are created by the PHYLIP suite of programs for
111 bootstrap analysis.
112
113 For sequential files formats (e.g. fasta, genbank) each "record block" holds
114 a single sequence. For these files it would probably be safe to call
115 write() multiple times.
116
117 File Formats
118 ============
119 When specifying formats, use lowercase strings.
120 """
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136 """
137 FAO BioPython Developers
138 ========================
139 The way I envision this SeqIO system working as that for any sequence file
140 format we have an iterator that returns SeqRecord objects.
141
142 This also applies to interlaced fileformats (like clustal) where the file
143 cannot be read record by record. You should still return an iterator!
144
145 These file format specific sequence iterators may be implemented as:
146 * Classes which take a handle for __init__ and provide the __iter__ method
147 * Functions that take a handle, and return an iterator object
148 * Generator functions that take a handle, and yeild SeqRecord objects
149
150 It is then trivial to turn this iterator into a list of SeqRecord objects,
151 an in memory dictionary, or a multiple sequence alignment object.
152
153 For building the dictionary by default the id propery of each SeqRecord is
154 used as the key. You should always populate the id property, and it should
155 be unique. For some file formats the accession number is a good choice.
156
157 When adding a new file format, please use the same lower case format name
158 as BioPerl, or if they have not defined one, try the names used by EMBOSS.
159
160 See also http://biopython.org/wiki/SeqIO_dev
161
162 --Peter
163 """
164
165 import os
166 from StringIO import StringIO
167 from Bio.Seq import Seq
168 from Bio.SeqRecord import SeqRecord
169 from Bio.Align.Generic import Alignment
170
171 import AceIO
172 import FastaIO
173 import IgIO
174 import InsdcIO
175 import PhdIO
176 import SwissIO
177
178
179
180
181
182
183
184
185
186
187 _FormatToIterator ={"fasta" : FastaIO.FastaIterator,
188 "genbank" : InsdcIO.GenBankIterator,
189 "genbank-cds" : InsdcIO.GenBankCdsFeatureIterator,
190 "embl" : InsdcIO.EmblIterator,
191 "embl-cds" : InsdcIO.EmblCdsFeatureIterator,
192 "ig" : IgIO.IgIterator,
193 "swiss" : SwissIO.SwissIterator,
194 "phd" : PhdIO.PhdIterator,
195 "ace" : AceIO.AceIterator,
196 }
197
198 _FormatToWriter ={"fasta" : FastaIO.FastaWriter,
199 }
200
201 -def write(sequences, handle, format) :
202 """Write complete set of sequences to a file.
203
204 sequences - A list (or iterator) of SeqRecord objects
205 handle - File handle object to write to
206 format - What format to use.
207
208 You should close the handle after calling this function.
209
210 There is no return value.
211 """
212 from Bio import AlignIO
213
214
215
216 if isinstance(handle, basestring) :
217 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
218 if not isinstance(format, basestring) :
219 raise TypeError("Need a string for the file format (lower case)")
220 if not format :
221 raise ValueError("Format required (lower case string)")
222 if format <> format.lower() :
223 raise ValueError("Format string '%s' should be lower case" % format)
224 if isinstance(sequences,SeqRecord):
225 raise ValueError("Use a SeqRecord list/iterator, not just a single SeqRecord")
226
227
228 if format in _FormatToWriter :
229 writer_class = _FormatToWriter[format]
230 writer_class(handle).write_file(sequences)
231
232
233 elif format in AlignIO._FormatToIterator :
234
235
236 AlignIO.write([to_alignment(sequences)], handle, format)
237 else :
238 raise ValueError("Unknown format '%s'" % format)
239
240 return
241
242 -def parse(handle, format) :
243 """Turns a sequence file into an iterator returning SeqRecords.
244
245 handle - handle to the file.
246 format - string describing the file format.
247
248 If you have the file name in a string 'filename', use:
249
250 from Bio import SeqIO
251 my_iterator = SeqIO.parse(open(filename,"rU"), format)
252
253 If you have a string 'data' containing the file contents, use:
254
255 from Bio import SeqIO
256 from StringIO import StringIO
257 my_iterator = SeqIO.parse(StringIO(data), format)
258
259 Note that file will be parsed with default settings,
260 which may result in a generic alphabet or other non-ideal
261 settings. For more control, you must use the format specific
262 iterator directly...
263
264 Use the Bio.SeqIO.read(handle, format) function when you expect
265 a single record only.
266 """
267 from Bio import AlignIO
268
269
270 if isinstance(handle, basestring) :
271 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
272 if not isinstance(format, basestring) :
273 raise TypeError("Need a string for the file format (lower case)")
274 if not format :
275 raise ValueError("Format required (lower case string)")
276 if format <> format.lower() :
277 raise ValueError("Format string '%s' should be lower case" % format)
278
279
280 if format in _FormatToIterator :
281 iterator_generator = _FormatToIterator[format]
282 return iterator_generator(handle)
283 elif format in AlignIO._FormatToIterator :
284
285 return _iterate_via_AlignIO(handle, format)
286 else :
287 raise ValueError("Unknown format '%s'" % format)
288
289
296
297 -def read(handle, format) :
298 """Turns a sequence file into a single SeqRecord.
299
300 handle - handle to the file.
301 format - string describing the file format.
302
303 If the handle contains no records, or more than one record,
304 an exception is raised. For example, using a GenBank file
305 containing one record:
306
307 from Bio import SeqIO
308 record = SeqIO.read(open("example.gbk"), "genbank")
309
310 If however you want the first record from a file containing,
311 multiple records this function would raise an exception.
312 Instead use:
313
314 from Bio import SeqIO
315 record = SeqIO.parse(open("example.gbk"), "genbank").next()
316
317 Use the Bio.SeqIO.parse(handle, format) function if you want
318 to read multiple records from the handle.
319 """
320 iterator = parse(handle, format)
321 try :
322 first = iterator.next()
323 except StopIteration :
324 first = None
325 if first is None :
326 raise ValueError, "No records found in handle"
327 try :
328 second = iterator.next()
329 except StopIteration :
330 second = None
331 if second is not None :
332 raise ValueError, "More than one record found in handle"
333 return first
334
335 -def to_dict(sequences, key_function=None) :
336 """Turns a sequence iterator or list into a dictionary.
337
338 sequences - An iterator that returns SeqRecord objects,
339 or simply a list of SeqRecord objects.
340 key_function - Optional function which when given a SeqRecord
341 returns a unique string for the dictionary key.
342
343 e.g. key_function = lambda rec : rec.name
344 or, key_function = lambda rec : rec.description.split()[0]
345
346 If key_function is ommitted then record.id is used, on the
347 assumption that the records objects returned are SeqRecords
348 with a unique id field.
349
350 If there are duplicate keys, an error is raised.
351
352 Example usage:
353
354 from Bio import SeqIO
355 filename = "example.fasta"
356 d = SeqIO.to_dict(SeqIO.parse(open(faa_filename, "rU")),
357 key_function = lambda rec : rec.description.split()[0])
358 print len(d)
359 print d.keys()[0:10]
360 key = d.keys()[0]
361 print d[key]
362 """
363 if key_function is None :
364 key_function = lambda rec : rec.id
365
366 d = dict()
367 for record in sequences :
368 key = key_function(record)
369 if key in d :
370 raise ValueError("Duplicate key '%s'" % key)
371 d[key] = record
372 return d
373
375 """Returns a multiple sequence alignment (OBSOLETE).
376
377 sequences -An iterator that returns SeqRecord objects,
378 or simply a list of SeqRecord objects.
379 All the record sequences must be the same length.
380 alphabet - Optional alphabet. Stongly recommended.
381 strict - Optional, defaults to True. Should error checking
382 be done?
383
384 Using this function is now discouraged. Rather doing this:
385
386 from Bio import SeqIO
387 alignment = SeqIO.to_alignment(SeqIO.parse(handle, format))
388
389 You are now encouraged to use Bio.AlignIO instead, e.g.
390
391 from Bio import AlignIO
392 alignment = AlignIO.read(handle, format)
393 """
394
395 from Bio.Alphabet import Alphabet, Gapped, generic_alphabet
396 if alphabet is None :
397 alphabet = Gapped(generic_alphabet)
398
399 if not (isinstance(alphabet, Alphabet) or isinstance(alphabet, Gapped)) :
400 raise ValueError("Invalid alignment alphabet")
401
402 alignment_length = None
403 alignment = Alignment(alphabet)
404 for record in sequences :
405 if strict :
406 if alignment_length is None :
407 alignment_length = len(record.seq)
408 elif alignment_length <> len(record.seq) :
409 raise ValueError("Sequences must all be the same length")
410
411 assert isinstance(record.seq.alphabet, Alphabet) \
412 or isinstance(record.seq.alphabet, Gapped), \
413 "Sequence does not have a valid alphabet"
414
415
416
417
418 if isinstance(record.seq.alphabet, Alphabet) \
419 and isinstance(alphabet, Alphabet) :
420
421 if not isinstance(record.seq.alphabet, alphabet.__class__) :
422 raise ValueError("Incompatible sequence alphabet " \
423 + "%s for %s alignment" \
424 % (record.seq.alphabet, alphabet))
425 elif isinstance(record.seq.alphabet, Gapped) \
426 and isinstance(alphabet, Alphabet) :
427 raise ValueError("Sequence has a gapped alphabet, alignment does not")
428 elif isinstance(record.seq.alphabet, Alphabet) \
429 and isinstance(alphabet, Gapped) :
430
431 if not isinstance(record.seq.alphabet, alphabet.alphabet.__class__) :
432 raise ValueError("Incompatible sequence alphabet " \
433 + "%s for %s alignment" \
434 % (record.seq.alphabet, alphabet))
435 else :
436
437 if not isinstance(record.seq.alphabet, alphabet.__class__) :
438 raise ValueError("Incompatible sequence alphabet " \
439 + "%s for %s alignment" \
440 % (record.seq.alphabet, alphabet))
441 if record.seq.alphabet.gap_char <> alphabet.gap_char :
442 raise ValueError("Sequence gap characters <> alignment gap char")
443
444
445
446
447
448
449 alignment._records.append(record)
450 return alignment
451
452 if __name__ == "__main__" :
453
454 from Bio.Alphabet import generic_nucleotide
455 from sets import Set
456
457
458
459 faa_example = \
460 """>V_Harveyi_PATH
461 mknwikvava aialsaatvq aatevkvgms gryfpftfvk qdklqgfevd mwdeigkrnd
462 ykieyvtanf sglfglletg ridtisnqit mtdarkakyl fadpyvvdga qitvrkgnds
463 iqgvedlagk tvavnlgsnf eqllrdydkd gkiniktydt giehdvalgr adafimdrls
464 alelikktgl plqlagepfe tiqnawpfvd nekgrklqae vnkalaemra dgtvekisvk
465 wfgaditk
466 >B_subtilis_YXEM
467 mkmkkwtvlv vaallavlsa cgngnssske ddnvlhvgat gqsypfayke ngkltgfdve
468 vmeavakkid mkldwkllef sglmgelqtg kldtisnqva vtderketyn ftkpyayagt
469 qivvkkdntd iksvddlkgk tvaavlgsnh aknleskdpd kkiniktyet qegtlkdvay
470 grvdayvnsr tvliaqikkt glplklagdp ivyeqvafpf akddahdklr kkvnkaldel
471 rkdgtlkkls ekyfneditv eqkh
472 >FLIY_ECOLI
473 mklahlgrqa lmgvmavalv agmsvksfad egllnkvker gtllvglegt yppfsfqgdd
474 gkltgfevef aqqlakhlgv easlkptkwd gmlasldskr idvvinqvti sderkkkydf
475 stpytisgiq alvkkgnegt iktaddlkgk kvgvglgtny eewlrqnvqg vdvrtydddp
476 tkyqdlrvgr idailvdrla aldlvkktnd tlavtgeafs rqesgvalrk gnedllkavn
477 daiaemqkdg tlqalsekwf gadvtk
478 >Deinococcus_radiodurans
479 mkksllslkl sgllvpsvla lslsacssps stlnqgtlki amegtyppft skneqgelvg
480 fdvdiakava qklnlkpefv ltewsgilag lqankydviv nqvgitperq nsigfsqpya
481 ysrpeiivak nntfnpqsla dlkgkrvgst lgsnyekqli dtgdikivty pgapeiladl
482 vagridaayn drlvvnyiin dqklpvrgag qigdaapvgi alkkgnsalk dqidkaltem
483 rsdgtfekis qkwfgqdvgq p
484 >B_subtilis_GlnH_homo_YCKK
485 mkkallalfm vvsiaalaac gagndnqskd nakdgdlwas ikkkgvltvg tegtyepfty
486 hdkdtdkltg ydveviteva krlglkvdfk etqwgsmfag lnskrfdvva nqvgktdred
487 kydfsdkytt sravvvtkkd nndikseadv kgktsaqslt snynklatna gakvegvegm
488 aqalqmiqqa rvdmtyndkl avlnylktsg nknvkiafet gepqstyftf rkgsgevvdq
489 vnkalkemke dgtlskiskk wfgedvsk
490 >YA80_HAEIN
491 mkkllfttal ltgaiafstf shageiadrv ektktllvgt egtyapftfh dksgkltgfd
492 vevirkvaek lglkvefket qwdamyagln akrfdvianq tnpsperlkk ysfttpynys
493 ggvivtkssd nsiksfedlk grksaqsats nwgkdakaag aqilvvdgla qslelikqgr
494 aeatindkla vldyfkqhpn sglkiaydrg dktptafafl qgedalitkf nqvlealrqd
495 gtlkqisiew fgyditq
496 >E_coli_GlnH
497 mksvlkvsla altlafavss haadkklvva tdtafvpfef kqgdkyvgfd vdlwaaiake
498 lkldyelkpm dfsgiipalq tknvdlalag ititderkka idfsdgyyks gllvmvkann
499 ndvksvkdld gkvvavksgt gsvdyakani ktkdlrqfpn idnaymelgt nradavlhdt
500 pnilyfikta gngqfkavgd sleaqqygia fpkgsdelrd kvngalktlr engtyneiyk
501 kwfgtepk
502 >HISJ_E_COLI
503 mkklvlslsl vlafssataa faaipqniri gtdptyapfe sknsqgelvg fdidlakelc
504 krintqctfv enpldalips lkakkidaim sslsitekrq qeiaftdkly aadsrlvvak
505 nsdiqptves lkgkrvgvlq gttqetfgne hwapkgieiv syqgqdniys dltagridaa
506 fqdevaaseg flkqpvgkdy kfggpsvkde klfgvgtgmg lrkednelre alnkafaemr
507 adgtyeklak kyfdfdvygg"""
508
509
510 aln_example = \
511 """CLUSTAL X (1.83) multiple sequence alignment
512
513
514 V_Harveyi_PATH --MKNWIKVAVAAIA--LSAA------------------TVQAATEVKVG
515 B_subtilis_YXEM MKMKKWTVLVVAALLAVLSACG------------NGNSSSKEDDNVLHVG
516 B_subtilis_GlnH_homo_YCKK MKKALLALFMVVSIAALAACGAGNDNQSKDNAKDGDLWASIKKKGVLTVG
517 YA80_HAEIN MKKLLFTTALLTGAIAFSTF-----------SHAGEIADRVEKTKTLLVG
518 FLIY_ECOLI MKLAHLGRQALMGVMAVALVAG---MSVKSFADEG-LLNKVKERGTLLVG
519 E_coli_GlnH --MKSVLKVSLAALTLAFAVS------------------SHAADKKLVVA
520 Deinococcus_radiodurans -MKKSLLSLKLSGLLVPSVLALS--------LSACSSPSSTLNQGTLKIA
521 HISJ_E_COLI MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG
522 : . : :.
523
524 V_Harveyi_PATH MSGRYFPFTFVKQ--DKLQGFEVDMWDEIGKRNDYKIEYVTANFSGLFGL
525 B_subtilis_YXEM ATGQSYPFAYKEN--GKLTGFDVEVMEAVAKKIDMKLDWKLLEFSGLMGE
526 B_subtilis_GlnH_homo_YCKK TEGTYEPFTYHDKDTDKLTGYDVEVITEVAKRLGLKVDFKETQWGSMFAG
527 YA80_HAEIN TEGTYAPFTFHDK-SGKLTGFDVEVIRKVAEKLGLKVEFKETQWDAMYAG
528 FLIY_ECOLI LEGTYPPFSFQGD-DGKLTGFEVEFAQQLAKHLGVEASLKPTKWDGMLAS
529 E_coli_GlnH TDTAFVPFEFKQG--DKYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPA
530 Deinococcus_radiodurans MEGTYPPFTSKNE-QGELVGFDVDIAKAVAQKLNLKPEFVLTEWSGILAG
531 HISJ_E_COLI TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS
532 ** .: *::::. : :. . ..:
533
534 V_Harveyi_PATH LETGRIDTISNQITMTDARKAKYLFADPYVVDG-AQITVRKGNDSIQGVE
535 B_subtilis_YXEM LQTGKLDTISNQVAVTDERKETYNFTKPYAYAG-TQIVVKKDNTDIKSVD
536 B_subtilis_GlnH_homo_YCKK LNSKRFDVVANQVG-KTDREDKYDFSDKYTTSR-AVVVTKKDNNDIKSEA
537 YA80_HAEIN LNAKRFDVIANQTNPSPERLKKYSFTTPYNYSG-GVIVTKSSDNSIKSFE
538 FLIY_ECOLI LDSKRIDVVINQVTISDERKKKYDFSTPYTISGIQALVKKGNEGTIKTAD
539 E_coli_GlnH LQTKNVDLALAGITITDERKKAIDFSDGYYKSG-LLVMVKANNNDVKSVK
540 Deinococcus_radiodurans LQANKYDVIVNQVGITPERQNSIGFSQPYAYSRPEIIVAKNNTFNPQSLA
541 HISJ_E_COLI LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE
542 *.: . * . * *: : : .
543
544 V_Harveyi_PATH DLAGKTVAVNLGSNFEQLLRDYDKDGKINIKTYDT--GIEHDVALGRADA
545 B_subtilis_YXEM DLKGKTVAAVLGSNHAKNLESKDPDKKINIKTYETQEGTLKDVAYGRVDA
546 B_subtilis_GlnH_homo_YCKK DVKGKTSAQSLTSNYNKLATN----AGAKVEGVEGMAQALQMIQQARVDM
547 YA80_HAEIN DLKGRKSAQSATSNWGKDAKA----AGAQILVVDGLAQSLELIKQGRAEA
548 FLIY_ECOLI DLKGKKVGVGLGTNYEEWLRQNV--QGVDVRTYDDDPTKYQDLRVGRIDA
549 E_coli_GlnH DLDGKVVAVKSGTGSVDYAKAN--IKTKDLRQFPNIDNAYMELGTNRADA
550 Deinococcus_radiodurans DLKGKRVGSTLGSNYEKQLIDTG---DIKIVTYPGAPEILADLVAGRIDA
551 HISJ_E_COLI SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA
552 .: *: . : .: : * :
553
554 V_Harveyi_PATH FIMDRLSALE-LIKKT-GLPLQLAGEPFETI-----QNAWPFVDNEKGRK
555 B_subtilis_YXEM YVNSRTVLIA-QIKKT-GLPLKLAGDPIVYE-----QVAFPFAKDDAHDK
556 B_subtilis_GlnH_homo_YCKK TYNDKLAVLN-YLKTSGNKNVKIAFETGEPQ-----STYFTFRKGS--GE
557 YA80_HAEIN TINDKLAVLD-YFKQHPNSGLKIAYDRGDKT-----PTAFAFLQGE--DA
558 FLIY_ECOLI ILVDRLAALD-LVKKT-NDTLAVTGEAFSRQ-----ESGVALRKGN--ED
559 E_coli_GlnH VLHDTPNILY-FIKTAGNGQFKAVGDSLEAQ-----QYGIAFPKGS--DE
560 Deinococcus_radiodurans AYNDRLVVNY-IINDQ-KLPVRGAGQIGDAA-----PVGIALKKGN--SA
561 HISJ_E_COLI AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE
562 . .: : . .
563
564 V_Harveyi_PATH LQAEVNKALAEMRADGTVEKISVKWFGADITK----
565 B_subtilis_YXEM LRKKVNKALDELRKDGTLKKLSEKYFNEDITVEQKH
566 B_subtilis_GlnH_homo_YCKK VVDQVNKALKEMKEDGTLSKISKKWFGEDVSK----
567 YA80_HAEIN LITKFNQVLEALRQDGTLKQISIEWFGYDITQ----
568 FLIY_ECOLI LLKAVNDAIAEMQKDGTLQALSEKWFGADVTK----
569 E_coli_GlnH LRDKVNGALKTLRENGTYNEIYKKWFGTEPK-----
570 Deinococcus_radiodurans LKDQIDKALTEMRSDGTFEKISQKWFGQDVGQP---
571 HISJ_E_COLI LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG---
572 : .: .: :: :** . : ::*. :
573 """
574
575
576
577
578
579 phy_example = \
580 """ 8 286
581 V_Harveyi_ --MKNWIKVA VAAIA--LSA A--------- ---------T VQAATEVKVG
582 B_subtilis MKMKKWTVLV VAALLAVLSA CG-------- ----NGNSSS KEDDNVLHVG
583 B_subtilis MKKALLALFM VVSIAALAAC GAGNDNQSKD NAKDGDLWAS IKKKGVLTVG
584 YA80_HAEIN MKKLLFTTAL LTGAIAFSTF ---------- -SHAGEIADR VEKTKTLLVG
585 FLIY_ECOLI MKLAHLGRQA LMGVMAVALV AG---MSVKS FADEG-LLNK VKERGTLLVG
586 E_coli_Gln --MKSVLKVS LAALTLAFAV S--------- ---------S HAADKKLVVA
587 Deinococcu -MKKSLLSLK LSGLLVPSVL ALS------- -LSACSSPSS TLNQGTLKIA
588 HISJ_E_COL MKKLVLSLSL VLAFSSATAA F--------- ---------- AAIPQNIRIG
589
590 MSGRYFPFTF VKQ--DKLQG FEVDMWDEIG KRNDYKIEYV TANFSGLFGL
591 ATGQSYPFAY KEN--GKLTG FDVEVMEAVA KKIDMKLDWK LLEFSGLMGE
592 TEGTYEPFTY HDKDTDKLTG YDVEVITEVA KRLGLKVDFK ETQWGSMFAG
593 TEGTYAPFTF HDK-SGKLTG FDVEVIRKVA EKLGLKVEFK ETQWDAMYAG
594 LEGTYPPFSF QGD-DGKLTG FEVEFAQQLA KHLGVEASLK PTKWDGMLAS
595 TDTAFVPFEF KQG--DKYVG FDVDLWAAIA KELKLDYELK PMDFSGIIPA
596 MEGTYPPFTS KNE-QGELVG FDVDIAKAVA QKLNLKPEFV LTEWSGILAG
597 TDPTYAPFES KNS-QGELVG FDIDLAKELC KRINTQCTFV ENPLDALIPS
598
599 LETGRIDTIS NQITMTDARK AKYLFADPYV VDG-AQITVR KGNDSIQGVE
600 LQTGKLDTIS NQVAVTDERK ETYNFTKPYA YAG-TQIVVK KDNTDIKSVD
601 LNSKRFDVVA NQVG-KTDRE DKYDFSDKYT TSR-AVVVTK KDNNDIKSEA
602 LNAKRFDVIA NQTNPSPERL KKYSFTTPYN YSG-GVIVTK SSDNSIKSFE
603 LDSKRIDVVI NQVTISDERK KKYDFSTPYT ISGIQALVKK GNEGTIKTAD
604 LQTKNVDLAL AGITITDERK KAIDFSDGYY KSG-LLVMVK ANNNDVKSVK
605 LQANKYDVIV NQVGITPERQ NSIGFSQPYA YSRPEIIVAK NNTFNPQSLA
606 LKAKKIDAIM SSLSITEKRQ QEIAFTDKLY AADSRLVVAK NSDIQP-TVE
607
608 DLAGKTVAVN LGSNFEQLLR DYDKDGKINI KTYDT--GIE HDVALGRADA
609 DLKGKTVAAV LGSNHAKNLE SKDPDKKINI KTYETQEGTL KDVAYGRVDA
610 DVKGKTSAQS LTSNYNKLAT N----AGAKV EGVEGMAQAL QMIQQARVDM
611 DLKGRKSAQS ATSNWGKDAK A----AGAQI LVVDGLAQSL ELIKQGRAEA
612 DLKGKKVGVG LGTNYEEWLR QNV--QGVDV RTYDDDPTKY QDLRVGRIDA
613 DLDGKVVAVK SGTGSVDYAK AN--IKTKDL RQFPNIDNAY MELGTNRADA
614 DLKGKRVGST LGSNYEKQLI DTG---DIKI VTYPGAPEIL ADLVAGRIDA
615 SLKGKRVGVL QGTTQETFGN EHWAPKGIEI VSYQGQDNIY SDLTAGRIDA
616
617 FIMDRLSALE -LIKKT-GLP LQLAGEPFET I-----QNAW PFVDNEKGRK
618 YVNSRTVLIA -QIKKT-GLP LKLAGDPIVY E-----QVAF PFAKDDAHDK
619 TYNDKLAVLN -YLKTSGNKN VKIAFETGEP Q-----STYF TFRKGS--GE
620 TINDKLAVLD -YFKQHPNSG LKIAYDRGDK T-----PTAF AFLQGE--DA
621 ILVDRLAALD -LVKKT-NDT LAVTGEAFSR Q-----ESGV ALRKGN--ED
622 VLHDTPNILY -FIKTAGNGQ FKAVGDSLEA Q-----QYGI AFPKGS--DE
623 AYNDRLVVNY -IINDQ-KLP VRGAGQIGDA A-----PVGI ALKKGN--SA
624 AFQDEVAASE GFLKQPVGKD YKFGGPSVKD EKLFGVGTGM GLRKED--NE
625
626 LQAEVNKALA EMRADGTVEK ISVKWFGADI TK----
627 LRKKVNKALD ELRKDGTLKK LSEKYFNEDI TVEQKH
628 VVDQVNKALK EMKEDGTLSK ISKKWFGEDV SK----
629 LITKFNQVLE ALRQDGTLKQ ISIEWFGYDI TQ----
630 LLKAVNDAIA EMQKDGTLQA LSEKWFGADV TK----
631 LRDKVNGALK TLRENGTYNE IYKKWFGTEP K-----
632 LKDQIDKALT EMRSDGTFEK ISQKWFGQDV GQP---
633 LREALNKAFA EMRADGTYEK LAKKYFDFDV YGG---
634 """
635
636 nxs_example = \
637 """#NEXUS
638 BEGIN DATA;
639 dimensions ntax=8 nchar=286;
640 format missing=?
641 symbols="ABCDEFGHIKLMNPQRSTUVWXYZ"
642 interleave datatype=PROTEIN gap= -;
643
644 matrix
645 V_Harveyi_PATH --MKNWIKVAVAAIA--LSAA------------------TVQAATEVKVG
646 B_subtilis_YXEM MKMKKWTVLVVAALLAVLSACG------------NGNSSSKEDDNVLHVG
647 B_subtilis_GlnH_homo_YCKK MKKALLALFMVVSIAALAACGAGNDNQSKDNAKDGDLWASIKKKGVLTVG
648 YA80_HAEIN MKKLLFTTALLTGAIAFSTF-----------SHAGEIADRVEKTKTLLVG
649 FLIY_ECOLI MKLAHLGRQALMGVMAVALVAG---MSVKSFADEG-LLNKVKERGTLLVG
650 E_coli_GlnH --MKSVLKVSLAALTLAFAVS------------------SHAADKKLVVA
651 Deinococcus_radiodurans -MKKSLLSLKLSGLLVPSVLALS--------LSACSSPSSTLNQGTLKIA
652 HISJ_E_COLI MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG
653
654 V_Harveyi_PATH MSGRYFPFTFVKQ--DKLQGFEVDMWDEIGKRNDYKIEYVTANFSGLFGL
655 B_subtilis_YXEM ATGQSYPFAYKEN--GKLTGFDVEVMEAVAKKIDMKLDWKLLEFSGLMGE
656 B_subtilis_GlnH_homo_YCKK TEGTYEPFTYHDKDTDKLTGYDVEVITEVAKRLGLKVDFKETQWGSMFAG
657 YA80_HAEIN TEGTYAPFTFHDK-SGKLTGFDVEVIRKVAEKLGLKVEFKETQWDAMYAG
658 FLIY_ECOLI LEGTYPPFSFQGD-DGKLTGFEVEFAQQLAKHLGVEASLKPTKWDGMLAS
659 E_coli_GlnH TDTAFVPFEFKQG--DKYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPA
660 Deinococcus_radiodurans MEGTYPPFTSKNE-QGELVGFDVDIAKAVAQKLNLKPEFVLTEWSGILAG
661 HISJ_E_COLI TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS
662
663 V_Harveyi_PATH LETGRIDTISNQITMTDARKAKYLFADPYVVDG-AQITVRKGNDSIQGVE
664 B_subtilis_YXEM LQTGKLDTISNQVAVTDERKETYNFTKPYAYAG-TQIVVKKDNTDIKSVD
665 B_subtilis_GlnH_homo_YCKK LNSKRFDVVANQVG-KTDREDKYDFSDKYTTSR-AVVVTKKDNNDIKSEA
666 YA80_HAEIN LNAKRFDVIANQTNPSPERLKKYSFTTPYNYSG-GVIVTKSSDNSIKSFE
667 FLIY_ECOLI LDSKRIDVVINQVTISDERKKKYDFSTPYTISGIQALVKKGNEGTIKTAD
668 E_coli_GlnH LQTKNVDLALAGITITDERKKAIDFSDGYYKSG-LLVMVKANNNDVKSVK
669 Deinococcus_radiodurans LQANKYDVIVNQVGITPERQNSIGFSQPYAYSRPEIIVAKNNTFNPQSLA
670 HISJ_E_COLI LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE
671
672 V_Harveyi_PATH DLAGKTVAVNLGSNFEQLLRDYDKDGKINIKTYDT--GIEHDVALGRADA
673 B_subtilis_YXEM DLKGKTVAAVLGSNHAKNLESKDPDKKINIKTYETQEGTLKDVAYGRVDA
674 B_subtilis_GlnH_homo_YCKK DVKGKTSAQSLTSNYNKLATN----AGAKVEGVEGMAQALQMIQQARVDM
675 YA80_HAEIN DLKGRKSAQSATSNWGKDAKA----AGAQILVVDGLAQSLELIKQGRAEA
676 FLIY_ECOLI DLKGKKVGVGLGTNYEEWLRQNV--QGVDVRTYDDDPTKYQDLRVGRIDA
677 E_coli_GlnH DLDGKVVAVKSGTGSVDYAKAN--IKTKDLRQFPNIDNAYMELGTNRADA
678 Deinococcus_radiodurans DLKGKRVGSTLGSNYEKQLIDTG---DIKIVTYPGAPEILADLVAGRIDA
679 HISJ_E_COLI SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA
680
681 V_Harveyi_PATH FIMDRLSALE-LIKKT-GLPLQLAGEPFETI-----QNAWPFVDNEKGRK
682 B_subtilis_YXEM YVNSRTVLIA-QIKKT-GLPLKLAGDPIVYE-----QVAFPFAKDDAHDK
683 B_subtilis_GlnH_homo_YCKK TYNDKLAVLN-YLKTSGNKNVKIAFETGEPQ-----STYFTFRKGS--GE
684 YA80_HAEIN TINDKLAVLD-YFKQHPNSGLKIAYDRGDKT-----PTAFAFLQGE--DA
685 FLIY_ECOLI ILVDRLAALD-LVKKT-NDTLAVTGEAFSRQ-----ESGVALRKGN--ED
686 E_coli_GlnH VLHDTPNILY-FIKTAGNGQFKAVGDSLEAQ-----QYGIAFPKGS--DE
687 Deinococcus_radiodurans AYNDRLVVNY-IINDQ-KLPVRGAGQIGDAA-----PVGIALKKGN--SA
688 HISJ_E_COLI AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE
689
690 V_Harveyi_PATH LQAEVNKALAEMRADGTVEKISVKWFGADITK----
691 B_subtilis_YXEM LRKKVNKALDELRKDGTLKKLSEKYFNEDITVEQKH
692 B_subtilis_GlnH_homo_YCKK VVDQVNKALKEMKEDGTLSKISKKWFGEDVSK----
693 YA80_HAEIN LITKFNQVLEALRQDGTLKQISIEWFGYDITQ----
694 FLIY_ECOLI LLKAVNDAIAEMQKDGTLQALSEKWFGADVTK----
695 E_coli_GlnH LRDKVNGALKTLRENGTYNEIYKKWFGTEPK-----
696 Deinococcus_radiodurans LKDQIDKALTEMRSDGTFEKISQKWFGQDVGQP---
697 HISJ_E_COLI LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG---
698 ;
699 end;
700 """
701
702
703
704 nxs_example2 = \
705 """#NEXUS
706
707 Begin data;
708 Dimensions ntax=10 nchar=705;
709 Format datatype=dna interleave=yes gap=- missing=?;
710 Matrix
711 Cow ATGGCATATCCCATACAACTAGGATTCCAAGATGCAACATCACCAATCATAGAAGAACTA
712 Carp ATGGCACACCCAACGCAACTAGGTTTCAAGGACGCGGCCATACCCGTTATAGAGGAACTT
713 Chicken ATGGCCAACCACTCCCAACTAGGCTTTCAAGACGCCTCATCCCCCATCATAGAAGAGCTC
714 Human ATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTACTTCCCCTATCATAGAAGAGCTT
715 Loach ATGGCACATCCCACACAATTAGGATTCCAAGACGCGGCCTCACCCGTAATAGAAGAACTT
716 Mouse ATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTA
717 Rat ATGGCTTACCCATTTCAACTTGGCTTACAAGACGCTACATCACCTATCATAGAAGAACTT
718 Seal ATGGCATACCCCCTACAAATAGGCCTACAAGATGCAACCTCTCCCATTATAGAGGAGTTA
719 Whale ATGGCATATCCATTCCAACTAGGTTTCCAAGATGCAGCATCACCCATCATAGAAGAGCTC
720 Frog ATGGCACACCCATCACAATTAGGTTTTCAAGACGCAGCCTCTCCAATTATAGAAGAATTA
721
722 Cow CTTCACTTTCATGACCACACGCTAATAATTGTCTTCTTAATTAGCTCATTAGTACTTTAC
723 Carp CTTCACTTCCACGACCACGCATTAATAATTGTGCTCCTAATTAGCACTTTAGTTTTATAT
724 Chicken GTTGAATTCCACGACCACGCCCTGATAGTCGCACTAGCAATTTGCAGCTTAGTACTCTAC
725 Human ATCACCTTTCATGATCACGCCCTCATAATCATTTTCCTTATCTGCTTCCTAGTCCTGTAT
726 Loach CTTCACTTCCATGACCATGCCCTAATAATTGTATTTTTGATTAGCGCCCTAGTACTTTAT
727 Mouse ATAAATTTCCATGATCACACACTAATAATTGTTTTCCTAATTAGCTCCTTAGTCCTCTAT
728 Rat ACAAACTTTCATGACCACACCCTAATAATTGTATTCCTCATCAGCTCCCTAGTACTTTAT
729 Seal CTACACTTCCATGACCACACATTAATAATTGTGTTCCTAATTAGCTCATTAGTACTCTAC
730 Whale CTACACTTTCACGATCATACACTAATAATCGTTTTTCTAATTAGCTCTTTAGTTCTCTAC
731 Frog CTTCACTTCCACGACCATACCCTCATAGCCGTTTTTCTTATTAGTACGCTAGTTCTTTAC
732
733 Cow ATTATTTCACTAATACTAACGACAAAGCTGACCCATACAAGCACGATAGATGCACAAGAA
734 Carp ATTATTACTGCAATGGTATCAACTAAACTTACTAATAAATATATTCTAGACTCCCAAGAA
735 Chicken CTTCTAACTCTTATACTTATAGAAAAACTATCA---TCAAACACCGTAGATGCCCAAGAA
736 Human GCCCTTTTCCTAACACTCACAACAAAACTAACTAATACTAACATCTCAGACGCTCAGGAA
737 Loach GTTATTATTACAACCGTCTCAACAAAACTCACTAACATATATATTTTGGACTCACAAGAA
738 Mouse ATCATCTCGCTAATATTAACAACAAAACTAACACATACAAGCACAATAGATGCACAAGAA
739 Rat ATTATTTCACTAATACTAACAACAAAACTAACACACACAAGCACAATAGACGCCCAAGAA
740 Seal ATTATCTCACTTATACTAACCACGAAACTCACCCACACAAGTACAATAGACGCACAAGAA
741 Whale ATTATTACCCTAATGCTTACAACCAAATTAACACATACTAGTACAATAGACGCCCAAGAA
742 Frog ATTATTACTATTATAATAACTACTAAACTAACTAATACAAACCTAATGGACGCACAAGAG
743
744 Cow GTAGAGACAATCTGAACCATTCTGCCCGCCATCATCTTAATTCTAATTGCTCTTCCTTCT
745 Carp ATCGAAATCGTATGAACCATTCTACCAGCCGTCATTTTAGTACTAATCGCCCTGCCCTCC
746 Chicken GTTGAACTAATCTGAACCATCCTACCCGCTATTGTCCTAGTCCTGCTTGCCCTCCCCTCC
747 Human ATAGAAACCGTCTGAACTATCCTGCCCGCCATCATCCTAGTCCTCATCGCCCTCCCATCC
748 Loach ATTGAAATCGTATGAACTGTGCTCCCTGCCCTAATCCTCATTTTAATCGCCCTCCCCTCA
749 Mouse GTTGAAACCATTTGAACTATTCTACCAGCTGTAATCCTTATCATAATTGCTCTCCCCTCT
750 Rat GTAGAAACAATTTGAACAATTCTCCCAGCTGTCATTCTTATTCTAATTGCCCTTCCCTCC
751 Seal GTGGAAACGGTGTGAACGATCCTACCCGCTATCATTTTAATTCTCATTGCCCTACCATCA
752 Whale GTAGAAACTGTCTGAACTATCCTCCCAGCCATTATCTTAATTTTAATTGCCTTGCCTTCA
753 Frog ATCGAAATAGTGTGAACTATTATACCAGCTATTAGCCTCATCATAATTGCCCTTCCATCC
754
755 Cow TTACGAATTCTATACATAATAGATGAAATCAATAACCCATCTCTTACAGTAAAAACCATA
756 Carp CTACGCATCCTGTACCTTATAGACGAAATTAACGACCCTCACCTGACAATTAAAGCAATA
757 Chicken CTCCAAATCCTCTACATAATAGACGAAATCGACGAACCTGATCTCACCCTAAAAGCCATC
758 Human CTACGCATCCTTTACATAACAGACGAGGTCAACGATCCCTCCCTTACCATCAAATCAATT
759 Loach CTACGAATTCTATATCTTATAGACGAGATTAATGACCCCCACCTAACAATTAAGGCCATG
760 Mouse CTACGCATTCTATATATAATAGACGAAATCAACAACCCCGTATTAACCGTTAAAACCATA
761 Rat CTACGAATTCTATACATAATAGACGAGATTAATAACCCAGTTCTAACAGTAAAAACTATA
762 Seal TTACGAATCCTCTACATAATGGACGAGATCAATAACCCTTCCTTGACCGTAAAAACTATA
763 Whale TTACGGATCCTTTACATAATAGACGAAGTCAATAACCCCTCCCTCACTGTAAAAACAATA
764 Frog CTTCGTATCCTATATTTAATAGATGAAGTTAATGATCCACACTTAACAATTAAAGCAATC
765
766 Cow GGACATCAGTGATACTGAAGCTATGAGTATACAGATTATGAGGACTTAAGCTTCGACTCC
767 Carp GGACACCAATGATACTGAAGTTACGAGTATACAGACTATGAAAATCTAGGATTCGACTCC
768 Chicken GGACACCAATGATACTGAACCTATGAATACACAGACTTCAAGGACCTCTCATTTGACTCC
769 Human GGCCACCAATGGTACTGAACCTACGAGTACACCGACTACGGCGGACTAATCTTCAACTCC
770 Loach GGGCACCAATGATACTGAAGCTACGAGTATACTGATTATGAAAACTTAAGTTTTGACTCC
771 Mouse GGGCACCAATGATACTGAAGCTACGAATATACTGACTATGAAGACCTATGCTTTGATTCA
772 Rat GGACACCAATGATACTGAAGCTATGAATATACTGACTATGAAGACCTATGCTTTGACTCC
773 Seal GGACATCAGTGATACTGAAGCTATGAGTACACAGACTACGAAGACCTGAACTTTGACTCA
774 Whale GGTCACCAATGATATTGAAGCTATGAGTATACCGACTACGAAGACCTAAGCTTCGACTCC
775 Frog GGCCACCAATGATACTGAAGCTACGAATATACTAACTATGAGGATCTCTCATTTGACTCT
776
777 Cow TACATAATTCCAACATCAGAATTAAAGCCAGGGGAGCTACGACTATTAGAAGTCGATAAT
778 Carp TATATAGTACCAACCCAAGACCTTGCCCCCGGACAATTCCGACTTCTGGAAACAGACCAC
779 Chicken TACATAACCCCAACAACAGACCTCCCCCTAGGCCACTTCCGCCTACTAGAAGTCGACCAT
780 Human TACATACTTCCCCCATTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAAT
781 Loach TACATAATCCCCACCCAGGACCTAACCCCTGGACAATTCCGGCTACTAGAGACAGACCAC
782 Mouse TATATAATCCCAACAAACGACCTAAAACCTGGTGAACTACGACTGCTAGAAGTTGATAAC
783 Rat TACATAATCCCAACCAATGACCTAAAACCAGGTGAACTTCGTCTATTAGAAGTTGATAAT
784 Seal TATATGATCCCCACACAAGAACTAAAGCCCGGAGAACTACGACTGCTAGAAGTAGACAAT
785 Whale TATATAATCCCAACATCAGACCTAAAGCCAGGAGAACTACGATTATTAGAAGTAGATAAC
786 Frog TATATAATTCCAACTAATGACCTTACCCCTGGACAATTCCGGCTGCTAGAAGTTGATAAT
787
788 Cow CGAGTTGTACTACCAATAGAAATAACAATCCGAATGTTAGTCTCCTCTGAAGACGTATTA
789 Carp CGAATAGTTGTTCCAATAGAATCCCCAGTCCGTGTCCTAGTATCTGCTGAAGACGTGCTA
790 Chicken CGCATTGTAATCCCCATAGAATCCCCCATTCGAGTAATCATCACCGCTGATGACGTCCTC
791 Human CGAGTAGTACTCCCGATTGAAGCCCCCATTCGTATAATAATTACATCACAAGACGTCTTG
792 Loach CGAATGGTTGTTCCCATAGAATCCCCTATTCGCATTCTTGTTTCCGCCGAAGATGTACTA
793 Mouse CGAGTCGTTCTGCCAATAGAACTTCCAATCCGTATATTAATTTCATCTGAAGACGTCCTC
794 Rat CGGGTAGTCTTACCAATAGAACTTCCAATTCGTATACTAATCTCATCCGAAGACGTCCTG
795 Seal CGAGTAGTCCTCCCAATAGAAATAACAATCCGCATACTAATCTCATCAGAAGATGTACTC
796 Whale CGAGTTGTCTTACCTATAGAAATAACAATCCGAATATTAGTCTCATCAGAAGACGTACTC
797 Frog CGAATAGTAGTCCCAATAGAATCTCCAACCCGACTTTTAGTTACAGCCGAAGACGTCCTC
798
799 Cow CACTCATGAGCTGTGCCCTCTCTAGGACTAAAAACAGACGCAATCCCAGGCCGTCTAAAC
800 Carp CATTCTTGAGCTGTTCCATCCCTTGGCGTAAAAATGGACGCAGTCCCAGGACGACTAAAT
801 Chicken CACTCATGAGCCGTACCCGCCCTCGGGGTAAAAACAGACGCAATCCCTGGACGACTAAAT
802 Human CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC
803 Loach CACTCCTGGGCCCTTCCAGCCATGGGGGTAAAGATAGACGCGGTCCCAGGACGCCTTAAC
804 Mouse CACTCATGAGCAGTCCCCTCCCTAGGACTTAAAACTGATGCCATCCCAGGCCGACTAAAT
805 Rat CACTCATGAGCCATCCCTTCACTAGGGTTAAAAACCGACGCAATCCCCGGCCGCCTAAAC
806 Seal CACTCATGAGCCGTACCGTCCCTAGGACTAAAAACTGATGCTATCCCAGGACGACTAAAC
807 Whale CACTCATGGGCCGTACCCTCCTTGGGCCTAAAAACAGATGCAATCCCAGGACGCCTAAAC
808 Frog CACTCGTGAGCTGTACCCTCCTTGGGTGTCAAAACAGATGCAATCCCAGGACGACTTCAT
809
810 Cow CAAACAACCCTTATATCGTCCCGTCCAGGCTTATATTACGGTCAATGCTCAGAAATTTGC
811 Carp CAAGCCGCCTTTATTGCCTCACGCCCAGGGGTCTTTTACGGACAATGCTCTGAAATTTGT
812 Chicken CAAACCTCCTTCATCACCACTCGACCAGGAGTGTTTTACGGACAATGCTCAGAAATCTGC
813 Human CAAACCACTTTCACCGCTACACGACCGGGGGTATACTACGGTCAATGCTCTGAAATCTGT
814 Loach CAAACCGCCTTTATTGCCTCCCGCCCCGGGGTATTCTATGGGCAATGCTCAGAAATCTGT
815 Mouse CAAGCAACAGTAACATCAAACCGACCAGGGTTATTCTATGGCCAATGCTCTGAAATTTGT
816 Rat CAAGCTACAGTCACATCAAACCGACCAGGTCTATTCTATGGCCAATGCTCTGAAATTTGC
817 Seal CAAACAACCCTAATAACCATACGACCAGGACTGTACTACGGTCAATGCTCAGAAATCTGT
818 Whale CAAACAACCTTAATATCAACACGACCAGGCCTATTTTATGGACAATGCTCAGAGATCTGC
819 Frog CAAACATCATTTATTGCTACTCGTCCGGGAGTATTTTACGGACAATGTTCAGAAATTTGC
820
821 Cow GGGTCAAACCACAGTTTCATACCCATTGTCCTTGAGTTAGTCCCACTAAAGTACTTTGAA
822 Carp GGAGCTAATCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCTCTCGAACACTTCGAA
823 Chicken GGAGCTAACCACAGCTACATACCCATTGTAGTAGAGTCTACCCCCCTAAAACACTTTGAA
824 Human GGAGCAAACCACAGTTTCATGCCCATCGTCCTAGAATTAATTCCCCTAAAAATCTTTGAA
825 Loach GGAGCAAACCACAGCTTTATACCCATCGTAGTAGAAGCGGTCCCACTATCTCACTTCGAA
826 Mouse GGATCTAACCATAGCTTTATGCCCATTGTCCTAGAAATGGTTCCACTAAAATATTTCGAA
827 Rat GGCTCAAATCACAGCTTCATACCCATTGTACTAGAAATAGTGCCTCTAAAATATTTCGAA
828 Seal GGTTCAAACCACAGCTTCATACCTATTGTCCTCGAATTGGTCCCACTATCCCACTTCGAG
829 Whale GGCTCAAACCACAGTTTCATACCAATTGTCCTAGAACTAGTACCCCTAGAAGTCTTTGAA
830 Frog GGAGCAAACCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCGCTAACCGACTTTGAA
831
832 Cow AAATGATCTGCGTCAATATTA---------------------TAA
833 Carp AACTGATCCTCATTAATACTAGAAGACGCCTCGCTAGGAAGCTAA
834 Chicken GCCTGATCCTCACTA------------------CTGTCATCTTAA
835 Human ATA---------------------GGGCCCGTATTTACCCTATAG
836 Loach AACTGGTCCACCCTTATACTAAAAGACGCCTCACTAGGAAGCTAA
837 Mouse AACTGATCTGCTTCAATAATT---------------------TAA
838 Rat AACTGATCAGCTTCTATAATT---------------------TAA
839 Seal AAATGATCTACCTCAATGCTT---------------------TAA
840 Whale AAATGATCTGTATCAATACTA---------------------TAA
841 Frog AACTGATCTTCATCAATACTA---GAAGCATCACTA------AGA
842 ;
843 End;
844 """
845
846
847
848 nxs_example3 = \
849 """#NEXUS
850
851 Begin data;
852 Dimensions ntax=10 nchar=234;
853 Format datatype=protein gap=- interleave;
854 Matrix
855 Cow MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
856 Carp MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQE
857 Chicken MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLS-SNTVDAQE
858 Human MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQE
859 Loach MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQE
860 Mouse MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
861 Rat MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
862 Seal MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE
863 Whale MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQE
864 Frog MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQE
865
866 Cow VETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDS
867 Carp IEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDS
868 Chicken VELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDS
869 Human METVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNS
870 Loach IEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDS
871 Mouse VETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDS
872 Rat VETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDS
873 Seal VETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDS
874 Whale VETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDS
875 Frog IEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDS
876
877 Cow YMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN
878 Carp YMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLN
879 Chicken YMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLN
880 Human YMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLN
881 Loach YMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLN
882 Mouse YMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLN
883 Rat YMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLN
884 Seal YMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLN
885 Whale YMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN
886 Frog YMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLH
887
888 Cow QTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML-------
889 Carp QAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS
890 Chicken QTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSL------LSS
891 Human QTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEM-------GPVFTL
892 Loach QTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS
893 Mouse QATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI-------
894 Rat QATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI-------
895 Seal QTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML-------
896 Whale QTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML-------
897 Frog QTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSML-EASL--
898 ;
899 End;
900 """
901
902
903
904 sth_example = \
905 """# STOCKHOLM 1.0
906 #=GF ID CBS
907 #=GF AC PF00571
908 #=GF DE CBS domain
909 #=GF AU Bateman A
910 #=GF CC CBS domains are small intracellular modules mostly found
911 #=GF CC in 2 or four copies within a protein.
912 #=GF SQ 67
913 #=GS O31698/18-71 AC O31698
914 #=GS O83071/192-246 AC O83071
915 #=GS O83071/259-312 AC O83071
916 #=GS O31698/88-139 AC O31698
917 #=GS O31698/88-139 OS Bacillus subtilis
918 O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS
919 #=GR O83071/192-246 SA 999887756453524252..55152525....36463774777
920 O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY
921 #=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE
922 O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS
923 #=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
924 O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
925 #=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH
926 #=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
927 O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
928 #=GR O31699/88-139 AS ________________*__________________________
929 #=GR_O31699/88-139_IN ____________1______________2__________0____
930 //
931 """
932
933
934
935 sth_example2 = \
936 """# STOCKHOLM 1.0
937 #=GC SS_cons .................<<<<<<<<...<<<<<<<........>>>>>>>..
938 AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU
939 #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..--
940 AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU
941 #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----
942
943 #=GC SS_cons ......<<<<<<<.......>>>>>>>..>>>>>>>>...............
944 AP001509.1 CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU
945 #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>---------------
946 AE007476.1 UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU
947 #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>---------------
948 //"""
949
950
951
952 gbk_example = \
953 """LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
954 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
955 (AXL2) and Rev7p (REV7) genes, complete cds.
956 ACCESSION U49845
957 VERSION U49845.1 GI:1293613
958 KEYWORDS .
959 SOURCE Saccharomyces cerevisiae (baker's yeast)
960 ORGANISM Saccharomyces cerevisiae
961 Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
962 Saccharomycetales; Saccharomycetaceae; Saccharomyces.
963 REFERENCE 1 (bases 1 to 5028)
964 AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
965 TITLE Cloning and sequence of REV7, a gene whose function is required for
966 DNA damage-induced mutagenesis in Saccharomyces cerevisiae
967 JOURNAL Yeast 10 (11), 1503-1509 (1994)
968 PUBMED 7871890
969 REFERENCE 2 (bases 1 to 5028)
970 AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.
971 TITLE Selection of axial growth sites in yeast requires Axl2p, a novel
972 plasma membrane glycoprotein
973 JOURNAL Genes Dev. 10 (7), 777-793 (1996)
974 PUBMED 8846915
975 REFERENCE 3 (bases 1 to 5028)
976 AUTHORS Roemer,T.
977 TITLE Direct Submission
978 JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New
979 Haven, CT, USA
980 FEATURES Location/Qualifiers
981 source 1..5028
982 /organism="Saccharomyces cerevisiae"
983 /db_xref="taxon:4932"
984 /chromosome="IX"
985 /map="9"
986 CDS <1..206
987 /codon_start=3
988 /product="TCP1-beta"
989 /protein_id="AAA98665.1"
990 /db_xref="GI:1293614"
991 /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
992 AEVLLRVDNIIRARPRTANRQHM"
993 gene 687..3158
994 /gene="AXL2"
995 CDS 687..3158
996 /gene="AXL2"
997 /note="plasma membrane glycoprotein"
998 /codon_start=1
999 /function="required for axial budding pattern of S.
1000 cerevisiae"
1001 /product="Axl2p"
1002 /protein_id="AAA98666.1"
1003 /db_xref="GI:1293615"
1004 /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
1005 TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
1006 VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
1007 VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
1008 TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
1009 YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
1010 DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
1011 DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
1012 NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
1013 CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
1014 NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
1015 SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
1016 YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
1017 HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
1018 VDFSNKSNVNVGQVKDIHGRIPEML"
1019 gene complement(3300..4037)
1020 /gene="REV7"
1021 CDS complement(3300..4037)
1022 /gene="REV7"
1023 /codon_start=1
1024 /product="Rev7p"
1025 /protein_id="AAA98667.1"
1026 /db_xref="GI:1293616"
1027 /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ
1028 FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD
1029 KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR
1030 RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK
1031 LISGDDKILNGVYSQYEEGESIFGSLF"
1032 ORIGIN
1033 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
1034 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
1035 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
1036 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
1037 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa
1038 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa
1039 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat
1040 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga
1041 481 gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc
1042 541 tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga
1043 601 acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta
1044 661 cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag
1045 721 ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa
1046 781 aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata
1047 841 cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga
1048 901 gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac
1049 961 tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg
1050 1021 acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc
1051 1081 tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa
1052 1141 acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca
1053 1201 ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac
1054 1261 ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa
1055 1321 actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag
1056 1381 gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct
1057 1441 ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac
1058 1501 ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa
1059 1561 acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc
1060 1621 cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata
1061 1681 cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca
1062 1741 ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc
1063 1801 cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc
1064 1861 aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca
1065 1921 agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc
1066 1981 tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg
1067 2041 caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt
1068 2101 acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc
1069 2161 cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg
1070 2221 ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca
1071 2281 gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata
1072 2341 atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg
1073 2401 atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga
1074 2461 aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt
1075 2521 ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat
1076 2581 tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt
1077 2641 cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc
1078 2701 tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag
1079 2761 aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta
1080 2821 tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa
1081 2881 caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact
1082 2941 ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt
1083 3001 ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa
1084 3061 gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag
1085 3121 ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct
1086 3181 taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt
1087 3241 agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact
1088 3301 taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa
1089 3361 attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg
1090 3421 tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt
1091 3481 aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc
1092 3541 tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca
1093 3601 ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc
1094 3661 ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc
1095 3721 tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat
1096 3781 aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa
1097 3841 agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga
1098 3901 acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat
1099 3961 acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc
1100 4021 tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc
1101 4081 tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa
1102 4141 gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg
1103 4201 atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc
1104 4261 ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt
1105 4321 tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg
1106 4381 cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg
1107 4441 ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt
1108 4501 agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt
1109 4561 tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat
1110 4621 tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc
1111 4681 atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct
1112 4741 tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta
1113 4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac
1114 4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct
1115 4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct
1116 4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc
1117 //"""
1118
1119
1120
1121 gbk_example2 = \
1122 """LOCUS AAD51968 143 aa linear BCT 21-AUG-2001
1123 DEFINITION transcriptional regulator RovA [Yersinia enterocolitica].
1124 ACCESSION AAD51968
1125 VERSION AAD51968.1 GI:5805369
1126 DBSOURCE locus AF171097 accession AF171097.1
1127 KEYWORDS .
1128 SOURCE Yersinia enterocolitica
1129 ORGANISM Yersinia enterocolitica
1130 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
1131 Enterobacteriaceae; Yersinia.
1132 REFERENCE 1 (residues 1 to 143)
1133 AUTHORS Revell,P.A. and Miller,V.L.
1134 TITLE A chromosomally encoded regulator is required for expression of the
1135 Yersinia enterocolitica inv gene and for virulence
1136 JOURNAL Mol. Microbiol. 35 (3), 677-685 (2000)
1137 MEDLINE 20138369
1138 PUBMED 10672189
1139 REFERENCE 2 (residues 1 to 143)
1140 AUTHORS Revell,P.A. and Miller,V.L.
1141 TITLE Direct Submission
1142 JOURNAL Submitted (22-JUL-1999) Molecular Microbiology, Washington
1143 University School of Medicine, Campus Box 8230, 660 South Euclid,
1144 St. Louis, MO 63110, USA
1145 COMMENT Method: conceptual translation.
1146 FEATURES Location/Qualifiers
1147 source 1..143
1148 /organism="Yersinia enterocolitica"
1149 /mol_type="unassigned DNA"
1150 /strain="JB580v"
1151 /serotype="O:8"
1152 /db_xref="taxon:630"
1153 Protein 1..143
1154 /product="transcriptional regulator RovA"
1155 /name="regulates inv expression"
1156 CDS 1..143
1157 /gene="rovA"
1158 /coded_by="AF171097.1:380..811"
1159 /note="regulator of virulence"
1160 /transl_table=11
1161 ORIGIN
1162 1 mestlgsdla rlvrvwrali dhrlkplelt qthwvtlhni nrlppeqsqi qlakaigieq
1163 61 pslvrtldql eekglitrht candrrakri klteqsspii eqvdgvicst rkeilggisp
1164 121 deiellsgli dklerniiql qsk
1165 //"""
1166
1167
1168 swiss_example = \
1169 """ID 104K_THEAN Reviewed; 893 AA.
1170 AC Q4U9M9;
1171 DT 18-APR-2006, integrated into UniProtKB/Swiss-Prot.
1172 DT 05-JUL-2005, sequence version 1.
1173 DT 31-OCT-2006, entry version 8.
1174 DE 104 kDa microneme-rhoptry antigen precursor (p104).
1175 GN ORFNames=TA08425;
1176 OS Theileria annulata.
1177 OC Eukaryota; Alveolata; Apicomplexa; Piroplasmida; Theileriidae;
1178 OC Theileria.
1179 OX NCBI_TaxID=5874;
1180 RN [1]
1181 RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
1182 RC STRAIN=Ankara;
1183 RX PubMed=15994557; DOI=10.1126/science.1110418;
1184 RA Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W.,
1185 RA Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M.,
1186 RA Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N.,
1187 RA Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F.,
1188 RA Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F.,
1189 RA Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E.,
1190 RA Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R.,
1191 RA Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E.,
1192 RA Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A.,
1193 RA Barrell B.G., Hall N.;
1194 RT "Genome of the host-cell transforming parasite Theileria annulata
1195 RT compared with T. parva.";
1196 RL Science 309:131-133(2005).
1197 CC -!- SUBCELLULAR LOCATION: Cell membrane; lipid-anchor; GPI-anchor
1198 CC (Potential). In microneme/rhoptry complexes (By similarity).
1199 DR EMBL; CR940353; CAI76474.1; -; Genomic_DNA.
1200 DR InterPro; IPR007480; DUF529.
1201 DR Pfam; PF04385; FAINT; 4.
1202 KW Complete proteome; GPI-anchor; Lipoprotein; Membrane; Repeat; Signal;
1203 KW Sporozoite.
1204 FT SIGNAL 1 19 Potential.
1205 FT CHAIN 20 873 104 kDa microneme-rhoptry antigen.
1206 FT /FTId=PRO_0000232680.
1207 FT PROPEP 874 893 Removed in mature form (Potential).
1208 FT /FTId=PRO_0000232681.
1209 FT COMPBIAS 215 220 Poly-Leu.
1210 FT COMPBIAS 486 683 Lys-rich.
1211 FT COMPBIAS 854 859 Poly-Arg.
1212 FT LIPID 873 873 GPI-anchor amidated aspartate
1213 FT (Potential).
1214 SQ SEQUENCE 893 AA; 101921 MW; 2F67CEB3B02E7AC1 CRC64;
1215 MKFLVLLFNI LCLFPILGAD ELVMSPIPTT DVQPKVTFDI NSEVSSGPLY LNPVEMAGVK
1216 YLQLQRQPGV QVHKVVEGDI VIWENEEMPL YTCAIVTQNE VPYMAYVELL EDPDLIFFLK
1217 EGDQWAPIPE DQYLARLQQL RQQIHTESFF SLNLSFQHEN YKYEMVSSFQ HSIKMVVFTP
1218 KNGHICKMVY DKNIRIFKAL YNEYVTSVIG FFRGLKLLLL NIFVIDDRGM IGNKYFQLLD
1219 DKYAPISVQG YVATIPKLKD FAEPYHPIIL DISDIDYVNF YLGDATYHDP GFKIVPKTPQ
1220 CITKVVDGNE VIYESSNPSV ECVYKVTYYD KKNESMLRLD LNHSPPSYTS YYAKREGVWV
1221 TSTYIDLEEK IEELQDHRST ELDVMFMSDK DLNVVPLTNG NLEYFMVTPK PHRDIIIVFD
1222 GSEVLWYYEG LENHLVCTWI YVTEGAPRLV HLRVKDRIPQ NTDIYMVKFG EYWVRISKTQ
1223 YTQEIKKLIK KSKKKLPSIE EEDSDKHGGP PKGPEPPTGP GHSSSESKEH EDSKESKEPK
1224 EHGSPKETKE GEVTKKPGPA KEHKPSKIPV YTKRPEFPKK SKSPKRPESP KSPKRPVSPQ
1225 RPVSPKSPKR PESLDIPKSP KRPESPKSPK RPVSPQRPVS PRRPESPKSP KSPKSPKSPK
1226 VPFDPKFKEK LYDSYLDKAA KTKETVTLPP VLPTDESFTH TPIGEPTAEQ PDDIEPIEES
1227 VFIKETGILT EEVKTEDIHS ETGEPEEPKR PDSPTKHSPK PTGTHPSMPK KRRRSDGLAL
1228 STTDLESEAG RILRDPTGKI VTMKRSKSFD DLTTVREKEH MGAEIRKIVV DDDGTEADDE
1229 DTHPSKEKHL STVRRRRPRP KKSSKSSKPR KPDSAFVPSI IFIFLVSLIV GIL
1230 //
1231 ID 104K_THEPA Reviewed; 924 AA.
1232 AC P15711; Q4N2B5;
1233 DT 01-APR-1990, integrated into UniProtKB/Swiss-Prot.
1234 DT 01-APR-1990, sequence version 1.
1235 DT 31-OCT-2006, entry version 31.
1236 DE 104 kDa microneme-rhoptry antigen precursor (p104).
1237 GN OrderedLocusNames=TP04_0437;
1238 OS Theileria parva.
1239 OC Eukaryota; Alveolata; Apicomplexa; Piroplasmida; Theileriidae;
1240 OC Theileria.
1241 OX NCBI_TaxID=5875;
1242 RN [1]
1243 RP NUCLEOTIDE SEQUENCE [GENOMIC DNA].
1244 RC STRAIN=Muguga;
1245 RX MEDLINE=90158697; PubMed=1689460; DOI=10.1016/0166-6851(90)90007-9;
1246 RA Iams K.P., Young J.R., Nene V., Desai J., Webster P., Ole-Moiyoi O.K.,
1247 RA Musoke A.J.;
1248 RT "Characterisation of the gene encoding a 104-kilodalton microneme-
1249 RT rhoptry protein of Theileria parva.";
1250 RL Mol. Biochem. Parasitol. 39:47-60(1990).
1251 RN [2]
1252 RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
1253 RC STRAIN=Muguga;
1254 RX PubMed=15994558; DOI=10.1126/science.1110439;
1255 RA Gardner M.J., Bishop R., Shah T., de Villiers E.P., Carlton J.M.,
1256 RA Hall N., Ren Q., Paulsen I.T., Pain A., Berriman M., Wilson R.J.M.,
1257 RA Sato S., Ralph S.A., Mann D.J., Xiong Z., Shallom S.J., Weidman J.,
1258 RA Jiang L., Lynn J., Weaver B., Shoaibi A., Domingo A.R., Wasawo D.,
1259 RA Crabtree J., Wortman J.R., Haas B., Angiuoli S.V., Creasy T.H., Lu C.,
1260 RA Suh B., Silva J.C., Utterback T.R., Feldblyum T.V., Pertea M.,
1261 RA Allen J., Nierman W.C., Taracha E.L.N., Salzberg S.L., White O.R.,
1262 RA Fitzhugh H.A., Morzaria S., Venter J.C., Fraser C.M., Nene V.;
1263 RT "Genome sequence of Theileria parva, a bovine pathogen that transforms
1264 RT lymphocytes.";
1265 RL Science 309:134-137(2005).
1266 CC -!- SUBCELLULAR LOCATION: Cell membrane; lipid-anchor; GPI-anchor
1267 CC (Potential). In microneme/rhoptry complexes.
1268 CC -!- DEVELOPMENTAL STAGE: Sporozoite antigen.
1269 DR EMBL; M29954; AAA18217.1; -; Unassigned_DNA.
1270 DR EMBL; AAGK01000004; EAN31789.1; -; Genomic_DNA.
1271 DR PIR; A44945; A44945.
1272 DR InterPro; IPR007480; DUF529.
1273 DR Pfam; PF04385; FAINT; 4.
1274 KW Complete proteome; GPI-anchor; Lipoprotein; Membrane; Repeat; Signal;
1275 KW Sporozoite.
1276 FT SIGNAL 1 19 Potential.
1277 FT CHAIN 20 904 104 kDa microneme-rhoptry antigen.
1278 FT /FTId=PRO_0000046081.
1279 FT PROPEP 905 924 Removed in mature form (Potential).
1280 FT /FTId=PRO_0000232679.
1281 FT COMPBIAS 508 753 Pro-rich.
1282 FT COMPBIAS 880 883 Poly-Arg.
1283 FT LIPID 904 904 GPI-anchor amidated aspartate
1284 FT (Potential).
1285 SQ SEQUENCE 924 AA; 103626 MW; 289B4B554A61870E CRC64;
1286 MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL
1287 QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG
1288 DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN
1289 GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK
1290 YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI
1291 TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT
1292 THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS
1293 EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT
1294 QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS
1295 SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR
1296 PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD
1297 DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK
1298 DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR
1299 SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL
1300 TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
1301 KKPDSAYIPS ILAILVVSLI VGIL
1302 //
1303 ID 108_SOLLC Reviewed; 102 AA.
1304 AC Q43495;
1305 DT 15-JUL-1999, integrated into UniProtKB/Swiss-Prot.
1306 DT 01-NOV-1996, sequence version 1.
1307 DT 31-OCT-2006, entry version 37.
1308 DE Protein 108 precursor.
1309 OS Solanum lycopersicum (Tomato) (Lycopersicon esculentum).
1310 OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
1311 OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
1312 OC asterids; lamiids; Solanales; Solanaceae; Solanum; Lycopersicon.
1313 OX NCBI_TaxID=4081;
1314 RN [1]
1315 RP NUCLEOTIDE SEQUENCE [MRNA].
1316 RC STRAIN=cv. VF36; TISSUE=Anther;
1317 RX MEDLINE=94143497; PubMed=8310077; DOI=10.1104/pp.101.4.1413;
1318 RA Chen R., Smith A.G.;
1319 RT "Nucleotide sequence of a stamen- and tapetum-specific gene from
1320 RT Lycopersicon esculentum.";
1321 RL Plant Physiol. 101:1413-1413(1993).
1322 CC -!- TISSUE SPECIFICITY: Stamen- and tapetum-specific.
1323 CC -!- SIMILARITY: Belongs to the A9/FIL1 family.
1324 DR EMBL; Z14088; CAA78466.1; -; mRNA.
1325 DR PIR; S26409; S26409.
1326 DR InterPro; IPR013770; LPT_helical.
1327 DR InterPro; IPR003612; LTP/seed_store/tryp_amyl_inhib.
1328 DR Pfam; PF00234; Tryp_alpha_amyl; 1.
1329 DR SMART; SM00499; AAI; 1.
1330 KW Signal.
1331 FT SIGNAL 1 30 Potential.
1332 FT CHAIN 31 102 Protein 108.
1333 FT /FTId=PRO_0000000238.
1334 FT DISULFID 41 77 By similarity.
1335 FT DISULFID 51 66 By similarity.
1336 FT DISULFID 67 92 By similarity.
1337 FT DISULFID 79 99 By similarity.
1338 SQ SEQUENCE 102 AA; 10576 MW; CFBAA1231C3A5E92 CRC64;
1339 MASVKSSSSS SSSSFISLLL LILLVIVLQS QVIECQPQQS CTASLTGLNV CAPFLVPGSP
1340 TASTECCNAV QSINHDCMCN TMRIAAQIPA QCNLPPLSCS AN
1341 //
1342 """
1343
1344 print "#########################################################"
1345 print "# Sequence Input Tests #"
1346 print "#########################################################"
1347
1348
1349
1350 tests = [
1351 (aln_example, "clustal", 8, "HISJ_E_COLI",
1352 "MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG" + \
1353 "TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS" + \
1354 "LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE" + \
1355 "SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA" + \
1356 "AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE" + \
1357 "LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG---", True),
1358 (phy_example, "phylip", 8, "HISJ_E_COL", None, False),
1359 (nxs_example, "nexus", 8, "HISJ_E_COLI", None, True),
1360 (nxs_example2, "nexus", 10, "Frog",
1361 "ATGGCACACCCATCACAATTAGGTTTTCAAGACGCAGCCTCTCCAATTATAGAAGAATTA" + \
1362 "CTTCACTTCCACGACCATACCCTCATAGCCGTTTTTCTTATTAGTACGCTAGTTCTTTAC" + \
1363 "ATTATTACTATTATAATAACTACTAAACTAACTAATACAAACCTAATGGACGCACAAGAG" + \
1364 "ATCGAAATAGTGTGAACTATTATACCAGCTATTAGCCTCATCATAATTGCCCTTCCATCC" + \
1365 "CTTCGTATCCTATATTTAATAGATGAAGTTAATGATCCACACTTAACAATTAAAGCAATC" + \
1366 "GGCCACCAATGATACTGAAGCTACGAATATACTAACTATGAGGATCTCTCATTTGACTCT" + \
1367 "TATATAATTCCAACTAATGACCTTACCCCTGGACAATTCCGGCTGCTAGAAGTTGATAAT" + \
1368 "CGAATAGTAGTCCCAATAGAATCTCCAACCCGACTTTTAGTTACAGCCGAAGACGTCCTC" + \
1369 "CACTCGTGAGCTGTACCCTCCTTGGGTGTCAAAACAGATGCAATCCCAGGACGACTTCAT" + \
1370 "CAAACATCATTTATTGCTACTCGTCCGGGAGTATTTTACGGACAATGTTCAGAAATTTGC" + \
1371 "GGAGCAAACCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCGCTAACCGACTTTGAA" + \
1372 "AACTGATCTTCATCAATACTA---GAAGCATCACTA------AGA", True),
1373 (nxs_example3, "nexus", 10, "Frog",
1374 'MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQE' + \
1375 'IEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDS' + \
1376 'YMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLH' + \
1377 'QTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSML-EASL--', True),
1378 (faa_example, "fasta", 8, "HISJ_E_COLI",
1379 'mkklvlslslvlafssataafaaipqnirigtdptyapfesknsqgelvgfdidlakelc' + \
1380 'krintqctfvenpldalipslkakkidaimsslsitekrqqeiaftdklyaadsrlvvak' + \
1381 'nsdiqptveslkgkrvgvlqgttqetfgnehwapkgieivsyqgqdniysdltagridaa' + \
1382 'fqdevaasegflkqpvgkdykfggpsvkdeklfgvgtgmglrkednelrealnkafaemr' + \
1383 'adgtyeklakkyfdfdvygg', True),
1384 (sth_example, "stockholm", 5, "O31699/88-139",
1385 'EVMLTDIPRLHINDPIMK--GFGMVINN------GFVCVENDE', True),
1386 (sth_example2, "stockholm", 2, "AE007476.1",
1387 'AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU' + \
1388 'UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU', True),
1389 (gbk_example, "genbank", 1, "U49845.1", None, True),
1390 (gbk_example2,"genbank", 1, 'AAD51968.1',
1391 "MESTLGSDLARLVRVWRALIDHRLKPLELTQTHWVTLHNINRLPPEQSQIQLAKAIGIEQ" + \
1392 "PSLVRTLDQLEEKGLITRHTCANDRRAKRIKLTEQSSPIIEQVDGVICSTRKEILGGISP" + \
1393 "DEIELLSGLIDKLERNIIQLQSK", True),
1394 (gbk_example, "genbank-cds", 3, "AAA98667.1",
1395 'MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEE' + \
1396 'LILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSS' + \
1397 'LNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKC' + \
1398 'QEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESI' + \
1399 'FGSLF', True),
1400 (swiss_example,"swiss", 3, "Q43495",
1401 "MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP" + \
1402 "TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN", True),
1403 ]
1404
1405 for (data, format, rec_count, last_id, last_seq, dict_check) in tests:
1406
1407 print "%s file with %i records" % (format, rec_count)
1408
1409 print "Bio.SeqIO.parse(handle)"
1410
1411
1412
1413 iterator = parse(StringIO(data), format=format)
1414 as_list = list(iterator)
1415 assert len(as_list) == rec_count, \
1416 "Expected %i records, found %i" \
1417 % (rec_count, len(as_list))
1418 assert as_list[-1].id == last_id, \
1419 "Expected '%s' as last record ID, found '%s'" \
1420 % (last_id, as_list[-1].id)
1421 if last_seq :
1422 assert as_list[-1].seq.tostring() == last_seq
1423
1424
1425 iterator = parse(StringIO(data), format=format)
1426 count = 1
1427 record = iterator.next()
1428 assert record is not None
1429 assert str(record.__class__) == "Bio.SeqRecord.SeqRecord"
1430
1431 for record in iterator :
1432 assert record.id == as_list[count].id
1433 assert record.seq.tostring() == as_list[count].seq.tostring()
1434 count = count + 1
1435 assert count == rec_count
1436 assert record is not None
1437 assert record.id == last_id
1438
1439
1440 iterator = parse(StringIO(data), format=format)
1441 count = 0
1442 while True :
1443 try :
1444 record = iterator.next()
1445 except StopIteration :
1446 break
1447 if record is None : break
1448 assert record.id == as_list[count].id
1449 assert record.seq.tostring() == as_list[count].seq.tostring()
1450 count=count+1
1451 assert count == rec_count
1452
1453 print "parse(...)"
1454 iterator = parse(StringIO(data), format=format)
1455 for (i, record) in enumerate(iterator) :
1456 assert record.id == as_list[i].id
1457 assert record.seq.tostring() == as_list[i].seq.tostring()
1458 assert i+1 == rec_count
1459
1460 print "parse(handle to empty file)"
1461 iterator = parse(StringIO(""), format=format)
1462 assert len(list(iterator))==0
1463
1464 if dict_check :
1465 print "to_dict(parse(...))"
1466 seq_dict = to_dict(parse(StringIO(data), format=format))
1467 assert Set(seq_dict.keys()) == Set([r.id for r in as_list])
1468 assert last_id in seq_dict
1469 assert seq_dict[last_id].seq.tostring() == as_list[-1].seq.tostring()
1470
1471 if len(Set([len(r.seq) for r in as_list]))==1 :
1472
1473
1474 print "to_alignment(parse(handle))"
1475 alignment = to_alignment(parse(handle = StringIO(data), format=format))
1476 assert len(alignment._records)==rec_count
1477 assert alignment.get_alignment_length() == len(as_list[0].seq)
1478 for i in range(0, rec_count) :
1479 assert as_list[i].id == alignment._records[i].id
1480 assert as_list[i].id == alignment.get_all_seqs()[i].id
1481 assert as_list[i].seq.tostring() == alignment._records[i].seq.tostring()
1482 assert as_list[i].seq.tostring() == alignment.get_all_seqs()[i].seq.tostring()
1483
1484 print "read(...)"
1485 if rec_count == 1 :
1486 record = read(StringIO(data), format)
1487 assert isinstance(record, SeqRecord)
1488 else :
1489 try :
1490 record = read(StringIO(data), format)
1491 assert False, "Should have failed"
1492 except ValueError :
1493
1494 pass
1495
1496 print
1497
1498 print "Checking phy <-> aln examples agree using list(parse(...))"
1499
1500
1501
1502 aln_list = list(parse(StringIO(aln_example), format="clustal"))
1503 phy_list = list(parse(StringIO(phy_example), format="phylip"))
1504 assert len(aln_list) == len(phy_list)
1505 assert Set([r.id[0:10] for r in aln_list]) == Set([r.id for r in phy_list])
1506 for i in range(0, len(aln_list)) :
1507 assert aln_list[i].id[0:10] == phy_list[i].id
1508 assert aln_list[i].seq.tostring() == phy_list[i].seq.tostring()
1509
1510 print "Checking nxs <-> aln examples agree using parse"
1511
1512
1513
1514 aln_iter = parse(StringIO(aln_example), format="clustal")
1515 nxs_iter = parse(StringIO(nxs_example), format="nexus")
1516 while True :
1517 try :
1518 aln_record = aln_iter.next()
1519 except StopIteration :
1520 aln_record = None
1521 try :
1522 nxs_record = nxs_iter.next()
1523 except StopIteration :
1524 nxs_record = None
1525 if aln_record is None or nxs_record is None :
1526 assert aln_record is None
1527 assert nxs_record is None
1528 break
1529 assert aln_record.id == nxs_record.id
1530 assert aln_record.seq.tostring() == nxs_record.seq.tostring()
1531
1532 print "Checking faa <-> aln examples agree using to_dict(parse(...)"
1533
1534 aln_dict = to_dict(parse(StringIO(aln_example), format="clustal"))
1535 faa_dict = to_dict(parse(StringIO(faa_example), format="fasta"))
1536
1537 ids = Set(aln_dict.keys())
1538 assert ids == Set(faa_dict.keys())
1539
1540 for id in ids :
1541
1542 assert aln_dict[id].seq.tostring().upper().replace("-","") == \
1543 faa_dict[id].seq.tostring().upper()
1544
1545 print
1546 print "#########################################################"
1547 print "# Sequence Output Tests #"
1548 print "#########################################################"
1549 print
1550
1551 general_output_formats = _FormatToWriter.keys()
1552 alignment_formats = ["phylip","stockholm","clustal"]
1553 for (in_data, in_format, rec_count, last_id, last_seq, unique_ids) in tests:
1554 if unique_ids :
1555 in_list = list(parse(StringIO(in_data), format=in_format))
1556 seq_lengths = [len(r.seq) for r in in_list]
1557 output_formats = general_output_formats[:]
1558 if min(seq_lengths)==max(seq_lengths) :
1559 output_formats.extend(alignment_formats)
1560 print "Checking conversion from %s (including to alignment formats)" % in_format
1561 else :
1562 print "Checking conversion from %s (excluding alignment formats)" % in_format
1563 for out_format in output_formats :
1564 print "Converting %s iterator -> %s" % (in_format, out_format)
1565 output = open("temp.txt","w")
1566 iterator = parse(StringIO(in_data), format=in_format)
1567
1568
1569
1570
1571 try :
1572 write(iterator, output, out_format)
1573 except ValueError, e:
1574 print "FAILED: %s" % str(e)
1575
1576 continue
1577
1578 output.close()
1579
1580 print "Checking %s <-> %s" % (in_format, out_format)
1581 out_list = list(parse(open("temp.txt","rU"), format=out_format))
1582
1583 assert rec_count == len(out_list)
1584 if last_seq :
1585 assert last_seq == out_list[-1].seq.tostring()
1586 if out_format=="phylip" :
1587 assert last_id[0:10] == out_list[-1].id
1588 else :
1589 assert last_id == out_list[-1].id
1590
1591 for i in range(0, rec_count) :
1592 assert in_list[-1].seq.tostring() == out_list[-1].seq.tostring()
1593 if out_format=="phylip" :
1594 assert in_list[i].id[0:10] == out_list[i].id
1595 else :
1596 assert in_list[i].id == out_list[i].id
1597 print
1598
1599 print "#########################################################"
1600 print "# SeqIO Tests finished #"
1601 print "#########################################################"
1602