In the last post I used Glimmer to predict ORFs. Glimmer generated a list of possible ORF together with the start and end position of each ORF and the containing gene. I did homology searches with BLASTX for all of the genes. The only setting that was changed was the Genetic code. It was changed to Bacteria and Archaea (11). The rest of the setting were the default settings. In this post I will make a summary of the findings of the homology searches.
Glimmer predicted 132 ORFs. Just by looking at the start and end position of each one there is one that is most likely wrongly predicted. It start at position 1090 and ends at 76526. Since this is almost the entire genome it should, with my basic knowledge about ORFs, genes and coding regions, be incorrectly predicted. There might be more genes that are incorrectly predicted, but these are not obvious to point out with the level of knowledge that we have gained up until now, even though there are some hypothetical proteins that are predicted more than once in almost the same positions, or running in opposite directions and overlapping each other. These can most likely be solved by looking more closely at the sequences of the genes and comparing them to close relatives of ETEC p7 to figure out which ones of these doublets that are the correct ones, but due to time constraints this has not been done.
On the topic of hypothetical proteins, most of the predicted genes were classified as hypothetical proteins, and thus of unknown protein function, but most of these hypothetical proteins have homologs in bacteriophages like ECBP2, ECBP3, phi32 and others.
Out of the 131 ORFs (not counting the ORF that covered almost the entire genome) 34 (about 26%) of them had genes with coding proteins of known function. These have been listed in the table below. For three of the 131 ORF no matches could be found with BLASTX. Since these might also be of interest to further look at they have also been listed in the table. A graph image of the annotated genome can be seen further down in this post.
|ORF/Gene||Position (start-stop)||Length||Predicted function|
|1||3776-1134||2642||putative tail fiber|
|3||5153-4641||512||bacterial Ig-like domain|
|4||5222-5746||524||major head protein forward|
|5||6220-5165||1055||major head protein reverse|
|8||9835-7592||2243||putative portal protein|
|9||11454-9901||1553||Bacteriophage terminase large (ATPase) subunit|
|14||14950-15063||113||Rossmann fold nucleotide-binding protein (very low score on BLASTX)|
|15||15066-15338||272||Putative protein (no matches with BLASTX)|
|17||16124-16318||194||Putative protein (no matches with BLASTX)|
|51||29229-29699||470||HNH homing endonuclease # Phage intron|
|66||41016-41549||533||deoxycytidine triphosphate deaminase|
|77||44316-44963||647||putative thymidylate synthase protein|
|78||44930-45097||167||putative NAD+ diphosphatase|
|81||45617-45835||218||NAD-dependent DNA ligase|
|83||46004-46744||740||putative PhoH-like protein|
|84||46753-46908||155||Putative protein (no matches with BLASTX)|
|105||53464-53967||503||putative serine/threonine protein|
|113||56596-57246||650||RNA polymerase sigma factor SigX|
|119||64185-59740||4445||chromosome segregation protein|
|123||68757-67978||779||internal virion protein|
|126||70789-69821||968||putative tail fiber protein|
|127||73857-70837||3020||bacterial surface protein|
|128||74670-73867||803||putative baseplate protein|
|129||75174-74683||491||Phage-related lysozyme (muramidase)|
|132||76529-75516||1013||tail fiber protein|
The contig starts and ends with putative tail proteins, and as the start and end of the genome are at the terminal repeats it is probable that the two tail protein sequences are either part of the same sequence or two different tail proteins the are located directly after each other. This can be investigated further by finding the correct arrangement of the genome. PhageTerm is a software that is able to make a prediction of the correct arrangement based on the terminal repeats. This can also be investigated by researching homologs of ETEC p7.
These sequences of know protein function can be divided into four categories: Structural proteins (e.g. tail fiber protein, major head protein, portal protein), replication assisting proteins (e.g. DNA polymerase, NAD-dependent DNA ligase), nucleotide metabolism proteins (e.g. deoxycytidine triphosphate deaminase) and other proteins (e.g. PhoH-like protein).
The extensive amount of proteins that play a part in the replication and recombination processes suggest that ETEC p7 is a virulent bacteriophage. To this end it is of particular interest that the genome contains a gene with a lysozyme. Since these enzymes cause the lysis of bacterial cell walls they are of specific interest in the research of new antibiotics. With more time this gene should have been investigated in more detail with potential homologs to determine its novelty.