Discussion about future direction for this project

Since there is a time constraint on this project and the deadline is approaching, there is no more time for any more, or any deeper analysis of the assembled genome of the ETEC p7 bacteriophage. This post will summarize the additional analysis and recommendations of approach for the further characterization of this genome.

The genes that were found through Glimmer and characterized through homology search can be further characterized by using for example BLASTP to compare the protein sequences with the protein sequences of other phage to get a sense of similarities and novelty of the proteins between ETEC p7 and other phage species.

Further characterization of the genome is also needed. We made a preliminary prediction of the position of the ORFs and genes. But Glimmer and ORFfinder made somewhat different predictions about the positions of each gene and it remains to investigate the exact positions of every gene and ORF. One way of doing this would be to find the promoters, Shine-Dalgarno sequences and transcriptional terminators. Finding these elements should make it possible to assess if the positions of the genes are correct or if they need to be adjusted.

We made an attempt of searching for promoters in the genome. But using three different softwares for this gave three completely different results. It turned out searching for promoters is difficult and can be time consuming. Promoters of viruses are either the same as the host promoters or very closely related (so that the RNA polymerase of the host will bind with the promoters of the virus). But the promoters can also be specific to the virus. The approach to finding the promoters are thus to find the sequences of the promoters of the bacterial host. If the exact sequence of the promoter is not found mismatches can be allowed. If the promoters are specific to the virus it will become very difficult to find the promoters, but one approach is to look for UTR regions of the genome.

We also made an attempt of finding the transcriptional terminators, but the number of promoters was not even close of matching the number of terminators. Thus, a lot more time and effort is needed into elucidating promoter, Shine-Dalgarno and terminator sequences of this genome.

The larger intergenic gaps should be further investigated for ORFs that might have been missed by Glimmer. For example by homology searches  by BLASTX or searching in databases over unfinished microbial genomes. There is a large gap from about 16300 to about 17600 that could potentially hold more ORFs.

As ETEC p7 has a genome consisting of double stranded DNA it belongs to the order of Caudovirales, but we have not been able to gain any definitive information about which family it belongs to. Our best guess at the moment is that it belongs to the family of podoviridae, since some of the apparently closests relatives of ETEC p7, like SU10 and phiEco32 are podoviridae. For the same reason we suspect that ETEC p7 has C3 morphology. But considering the fast evolution of bacteriophages and their ability to acquire DNA horizontally from both other phages and from their hosts, genomes of phages are mosaics and it is nog possible to just rely on close relationships according to homology searches. To be able to get a definitive answer studies of the structural proteins of the virion need to be conducted with different types of electron microscopes, so that visual assessments can be made. Furthermore, predicting secondary structures of the scaffolding proteins can also give clues to the morphology of the bacteriophages, as described in the paper by Mirzaei et al. Predicting secondary structure of protein sequences can be done with for example PSIPRED and JPred.

And lastly, a phylogenetic analyses needs to be conducted. For this it is necessary to have knowledge what features of the phages that scientist use to make the phylogenetic trees of phages. With very basic knowledge about this it seems that the most important features are scaffolding proteins and head proteins that has to be considered. This means that a study needs to be conducted where these structural proteins of ETEC p7 are compared to the same structural proteins of other bacteriophages.

Introduction to P7 phages and determining nature of chromosome ends

Research and background

The P7 bacteriophage belong to the order of Caudovirales, which contain a single linear double stranded DNA (dsDNA) and a have a tail. This order has three know families, Siphoviridae, Myoviridae and Podoviridae. The difference between these families is that they have different types of tails. The P7 bacteriophage belongs to the Myoviridae phages, which have a complex contractile tail. The mechanisms for DNA replication and packaging into procapsid can differ between different species of Caudovirales. By analyzing and determining the nature of the ends of the chromosomes it can be shed a light on the replication strategy of the bacteriophage.

Caudovirales have six know types of terminal ends. Phages use these different terminal ends to recognize their own DNA, rather than the DNA of their host’s. Most phages from this order package the DNA in a procapsid from concatemeric (repeating) DNA molecules that are frequently the result of rolling circular replication mechanisms. For P7 bacteriophages (that belong to the species of P1 bacteriophages) the mechanism of packaging is one that is called headful packaging, using a pac site. The pac site is where the terminase can initiate packaging. This leads to phages that have chromosomes that are terminally redundant and circularly permuted. An analysis of the terminals should confirm this.

After some research it seems there are two approaches of characterizing the termini of phages. The first one, that also was recommended by Professor Nilsson, is to use the software Geneious to look for regions of higher coverage. Since the terminal ends are repeats it is expected that this regions also have higher coverage. This should be combined with comparing the phage genome with a similar bacteriophage that has already been characterized, to be able to pinpoint the terminal repeats.

The second approach is to use the software PhageTerm. This software is freely available and uses the same principal as described above, by looking at regions of the data with a significantly higher number of reads compared to the rest of the genome. The advantage is that, unlike using Geneious which require experience to determine the terminals, PhageTerm uses a theoretical and statistical framework to determine the terminal repeats. Other advantages of PhageTerm are that it has been specifically investigated with Illumina technologies, tested with a range of de novo assembled bacteriophages and developed for dsDNA bacteriophages.


PhageTerm is developed by researchers at the Pasteur Institute and the institute also hosts PhageTerm on a Galaxy wrapper. This instance of PhageTerm was used to analyze the terminals of the genome assembled phage genome. The paired-end data and the assembled genome were given as inputs, with the default settings (seed length = 20, peak surrounding region = 20, limit coverage = 250). This resulted in a report [PDF] that put the starting position of the terminal repeats at 13344 and the ending position at 13592, which makes the terminal repeats 248 bps long. PhageTerm also classifies the ends as redundant and non permuting. If I understand the report correctly it identifies the genome as belonging to a T7 bacteriophage, but this needs to be discussed with professor Nilsson, since the information we were given was that genome should belong to a P7 bacteriophage. The difference between P1/p7 bacteriophages and T7 bacteriophages are that the chromosome ends of P1/P7 phages are permuted and the chromosome ends of T7 phages are not permuted.

PhageTerm also generates a file containing the phage genome sequence reorganized according to termini positions. It is unclear if we should proceed with this new reorganized file of the genome or continue with the genome that was assembled with SPAdes. This also needs to be discussed with professor Nilsson.

The starting and ending position of the terminal repeats as identified by the increase amounts of reads between these positions.
Summary of the analysis performed by PhageTerm.