Characterization of genes predicted by Glimmer

In the last post I used Glimmer to predict ORFs. Glimmer generated a list of possible ORF together with the start and end position of each ORF and the containing gene. I did homology searches with BLASTX for all of the genes.  The only setting that was changed was the Genetic code. It was changed to Bacteria and Archaea (11). The rest of the setting were the default settings. In this post I will make a summary of the findings of the homology searches.

Glimmer predicted 132 ORFs. Just by looking at the start and end position of each one there is one that is most likely wrongly predicted. It start at position 1090 and ends at 76526. Since this is almost the entire genome it should, with my basic knowledge about ORFs, genes and coding regions, be incorrectly predicted. There might be more genes that are incorrectly predicted, but these are not obvious to point out with the level of knowledge that we have gained up until now, even though there are some hypothetical proteins that are predicted more than once in almost the same positions, or running in opposite directions and overlapping each other. These can most likely be solved by looking more closely at the sequences of the genes and comparing them to close relatives of ETEC p7 to figure out which ones of these doublets that are the correct ones, but due to time constraints this has not been done.

On the topic of hypothetical proteins, most of the predicted genes were classified as hypothetical proteins, and thus of unknown protein function, but most of these hypothetical proteins have homologs in bacteriophages like ECBP2, ECBP3, phi32 and others.

Out of the 131 ORFs (not counting the ORF that covered almost the entire genome) 34 (about 26%) of them had genes with coding proteins of known function. These have been listed in the table below. For three of the 131 ORF no matches could be found with BLASTX. Since these might also be of interest to further look at they have also been listed in the table. A graph image of the annotated genome can be seen further down in this post.

ORF/Gene Position (start-stop) Length Predicted function
1 3776-1134 2642 putative tail fiber
3 5153-4641 512 bacterial Ig-like domain
4 5222-5746 524 major head protein forward
5 6220-5165 1055 major head protein reverse
8 9835-7592 2243 putative portal protein
9 11454-9901 1553 Bacteriophage terminase large (ATPase) subunit
14 14950-15063 113 Rossmann fold nucleotide-binding protein (very low score on BLASTX)
15 15066-15338 272 Putative protein (no matches with BLASTX)
17 16124-16318 194 Putative protein (no matches with BLASTX)
51 29229-29699 470 HNH homing endonuclease # Phage intron
61 36552-36983 431 Gamma-glutamyl cyclotransferase
63 37227-39002 1775 primase/helicase
64 38996-39544 548 DNA polymerase
66 41016-41549 533 deoxycytidine triphosphate deaminase
70 42403-42930 527 putative integrase
77 44316-44963 647 putative thymidylate synthase protein
78 44930-45097 167 putative NAD+ diphosphatase
81 45617-45835 218 NAD-dependent DNA ligase
83 46004-46744 740 putative PhoH-like protein
84 46753-46908 155 Putative protein (no matches with BLASTX)
89 48562-50277 1715 DNA polymerase
90 50268-50744 476 homing endonuclease
91 50861-51016 155 DNA_pol_A_pol_I_B
105 53464-53967 503 putative serine/threonine protein
109 55750-56136 386 HNH endonuclease
113 56596-57246 650 RNA polymerase sigma factor SigX
115 57811-58653 842 exonuclease
118 59212-59511 299 WYL domain
119 64185-59740 4445 chromosome segregation protein
122 67968-66916 1052 hemagglutinin protein
123 68757-67978 779 internal virion protein
126 70789-69821 968 putative tail fiber protein
127 73857-70837 3020 bacterial surface protein
128 74670-73867 803 putative baseplate protein
129 75174-74683 491 Phage-related lysozyme (muramidase)
130 75426-75208 218 putative holin
132 76529-75516 1013 tail fiber protein

The contig starts and ends with putative tail proteins, and as the start and end of the genome are at the terminal repeats it is probable that the two tail protein sequences are either part of the same sequence or two different tail proteins the are located directly after each other. This can be investigated further by finding the correct arrangement of the genome. PhageTerm is a software that is able to make a prediction of the correct arrangement based on the terminal repeats. This can also be investigated by researching homologs of ETEC p7.

These sequences of know protein function can be divided into four categories: Structural proteins (e.g. tail fiber protein, major head protein, portal protein), replication assisting proteins (e.g. DNA polymerase, NAD-dependent DNA ligase), nucleotide metabolism proteins (e.g. deoxycytidine triphosphate deaminase) and other proteins (e.g. PhoH-like protein).

The extensive amount of proteins that play a part in the replication and recombination processes suggest that ETEC p7 is a virulent bacteriophage. To this end it is of particular interest that the genome contains a gene with a lysozyme. Since these enzymes cause the lysis of bacterial cell walls they are of specific interest in the research of new antibiotics. With more time this gene should have been investigated in more detail with potential homologs to determine its novelty.

Annotation of bacteriophage ETEC p7
Annotation of genome of ETEC p7. Sequences with proteins of known function are in orange. Hypothetical proteins are in green. The terminal repeat is in red. The three sequences with no matches with BLASTX are in purple. Parts of low coverage are in dark red (ends of the contig) and regions of high coverage are in yellow. Graph generated with Geneious version 2019.0 created by Biomatters. Available from