Characterization of genes predicted by Glimmer

In the last post I used Glimmer to predict ORFs. Glimmer generated a list of possible ORF together with the start and end position of each ORF and the containing gene. I did homology searches with BLASTX for all of the genes.  The only setting that was changed was the Genetic code. It was changed to Bacteria and Archaea (11). The rest of the setting were the default settings. In this post I will make a summary of the findings of the homology searches.

Glimmer predicted 132 ORFs. Just by looking at the start and end position of each one there is one that is most likely wrongly predicted. It start at position 1090 and ends at 76526. Since this is almost the entire genome it should, with my basic knowledge about ORFs, genes and coding regions, be incorrectly predicted. There might be more genes that are incorrectly predicted, but these are not obvious to point out with the level of knowledge that we have gained up until now, even though there are some hypothetical proteins that are predicted more than once in almost the same positions, or running in opposite directions and overlapping each other. These can most likely be solved by looking more closely at the sequences of the genes and comparing them to close relatives of ETEC p7 to figure out which ones of these doublets that are the correct ones, but due to time constraints this has not been done.

On the topic of hypothetical proteins, most of the predicted genes were classified as hypothetical proteins, and thus of unknown protein function, but most of these hypothetical proteins have homologs in bacteriophages like ECBP2, ECBP3, phi32 and others.

Out of the 131 ORFs (not counting the ORF that covered almost the entire genome) 34 (about 26%) of them had genes with coding proteins of known function. These have been listed in the table below. For three of the 131 ORF no matches could be found with BLASTX. Since these might also be of interest to further look at they have also been listed in the table. A graph image of the annotated genome can be seen further down in this post.

ORF/Gene Position (start-stop) Length Predicted function
1 3776-1134 2642 putative tail fiber
3 5153-4641 512 bacterial Ig-like domain
4 5222-5746 524 major head protein forward
5 6220-5165 1055 major head protein reverse
8 9835-7592 2243 putative portal protein
9 11454-9901 1553 Bacteriophage terminase large (ATPase) subunit
14 14950-15063 113 Rossmann fold nucleotide-binding protein (very low score on BLASTX)
15 15066-15338 272 Putative protein (no matches with BLASTX)
17 16124-16318 194 Putative protein (no matches with BLASTX)
51 29229-29699 470 HNH homing endonuclease # Phage intron
61 36552-36983 431 Gamma-glutamyl cyclotransferase
63 37227-39002 1775 primase/helicase
64 38996-39544 548 DNA polymerase
66 41016-41549 533 deoxycytidine triphosphate deaminase
70 42403-42930 527 putative integrase
77 44316-44963 647 putative thymidylate synthase protein
78 44930-45097 167 putative NAD+ diphosphatase
81 45617-45835 218 NAD-dependent DNA ligase
83 46004-46744 740 putative PhoH-like protein
84 46753-46908 155 Putative protein (no matches with BLASTX)
89 48562-50277 1715 DNA polymerase
90 50268-50744 476 homing endonuclease
91 50861-51016 155 DNA_pol_A_pol_I_B
105 53464-53967 503 putative serine/threonine protein
109 55750-56136 386 HNH endonuclease
113 56596-57246 650 RNA polymerase sigma factor SigX
115 57811-58653 842 exonuclease
118 59212-59511 299 WYL domain
119 64185-59740 4445 chromosome segregation protein
122 67968-66916 1052 hemagglutinin protein
123 68757-67978 779 internal virion protein
126 70789-69821 968 putative tail fiber protein
127 73857-70837 3020 bacterial surface protein
128 74670-73867 803 putative baseplate protein
129 75174-74683 491 Phage-related lysozyme (muramidase)
130 75426-75208 218 putative holin
132 76529-75516 1013 tail fiber protein

The contig starts and ends with putative tail proteins, and as the start and end of the genome are at the terminal repeats it is probable that the two tail protein sequences are either part of the same sequence or two different tail proteins the are located directly after each other. This can be investigated further by finding the correct arrangement of the genome. PhageTerm is a software that is able to make a prediction of the correct arrangement based on the terminal repeats. This can also be investigated by researching homologs of ETEC p7.

These sequences of know protein function can be divided into four categories: Structural proteins (e.g. tail fiber protein, major head protein, portal protein), replication assisting proteins (e.g. DNA polymerase, NAD-dependent DNA ligase), nucleotide metabolism proteins (e.g. deoxycytidine triphosphate deaminase) and other proteins (e.g. PhoH-like protein).

The extensive amount of proteins that play a part in the replication and recombination processes suggest that ETEC p7 is a virulent bacteriophage. To this end it is of particular interest that the genome contains a gene with a lysozyme. Since these enzymes cause the lysis of bacterial cell walls they are of specific interest in the research of new antibiotics. With more time this gene should have been investigated in more detail with potential homologs to determine its novelty.

Annotation of bacteriophage ETEC p7
Annotation of genome of ETEC p7. Sequences with proteins of known function are in orange. Hypothetical proteins are in green. The terminal repeat is in red. The three sequences with no matches with BLASTX are in purple. Parts of low coverage are in dark red (ends of the contig) and regions of high coverage are in yellow. Graph generated with Geneious version 2019.0 created by Biomatters. Available from

Prediction of coding regions with Glimmer

Background and research

The past few days I have been reading up on and trying to understand how the software package Glimmer works, both theoretically and practically. This has been done by reading the paper accompanying the software [pdf], reading the manual, and testing.

Glimmer is a software package used to find and predict genes in bacteria archea and viruses. There are three major version of the software and for the predictions of the genes of ETEC p7 current version of 3.02 was used.

According to the paper accompanying the software genomes of microbes, including viruses, are very dense in genes and thus the accuracy of gene finding softwares is very high compared to finding genes in eukaryotes. Glimmer has a sensitivity of over 99% of detecting genes of prokaryotes and viruses. The third version made improvements in sensitivity of detecting genes by making improving the prediction of start sites. The developers also made improvements in detecting false positives. Since prokaryotes are have a high density in genes it is difficult to say that a predicted gene is a false positive. In previous versions of Glimmer a source of false postitives was also the prediction of too many overlapping genes. Overlapping genes is rare is bacterial and virus genomes. The solution the developers of Glimmer have chosen to reduce the rate of false-positives is to check the homology of the predicted genes with (close) relatives of the genome being analyzed.

At its core Glimmer makes predictions of genes in genomes of bacteria and viruses by creating a variable-length Markov model from a training set of genes and then uses the model to predict the genes in a DNA sequence.

Glimmer works in a two step mode. First the training set of genes (interpolated context model, ICM) has to be built with the software build-icm. This can be done in several ways depending on how much data or knowledge there is about the genome being analyzed. If the genes of the genome are known, for example from homology studies, then this is the best option. Other options are to use the program long-orfs (shipped with the Glimmer package) to generate long, non-overlapping ORFs from the genome, and to use genes in a highly similar species. After the ICM is generated the software glimmer3 is used to run the analysis to make the gene predictions.

Previously my group mates have done homology searches as they have BLASTed the genome of the ETEC p7 against the genome of SU10, which has the highest identity to the ETEC p7 genome. This resulted in a list of 26 genes that were matched on SU10 genome. They have also searched for ORFs using NCBI ORFfinder. A tool from NCBI that searches for ORFs in DNA sequences and returns the range of the ORFs together with the protein translations for each ORF. This list was much larger containing 108 ORFs. Comparing the list of genes and ORFs, there were only five items that probably were the same coding sequences. There are some limitations with ORFfinder. First of all the online version of ORFfinder limits the query sequence up to 50 kb. Second it uses start codons of “ATG” and alternative initiation codons or any sense codon to identify ORFs. Third, there is an option to choose the genetic code that is specific for the organism that is used with ORFfinder, and even though ORFfinder works better with prokaryotes and viral genomes there is no option for viral genomes, and the standard genetic code has to be chosen. For all of the reasons presented in this paragraph I decided not to use Glimmer with the list of genes that was generated through alignment with SU10 since it can’t be trusted that the genes belong to ETEC p7.


I used the option described in the manual of Glimmer where long-orfs is used to find a training set of putative genes from the ETEC p7 genome, to create a training set from this information and to predict the genes with Glimmer. The predicted genes is then used as the training set in a second run with Glimmer. This approach is the best one when there are no know genes of the genome. Also, Glimmer comes shipped with a set of pre-written C-shell scripts that minimizes the need to write the scripts to achieve this approach.

There was a lot of issues to get Glimmer to run the analysis. It took a lot of time and googling to get it to work. I downloaded Glimmer (which is open source) and compiled the binaries. In the manual it says that in addition to Glimmer the general purpose Gibbs sampler Elph is also needed, which I also downloaded. To run the analysis I used the script named g3-iterated.csh that can be found in the scripts folder of Glimmer. The paths inside this script need to be adjusted. The script was run from the terminal with default settings and as is instructed in the manual.

g3-iterated.csh genom.seq run3

The g3-iterated.csh script is a C shell script that uses the long-orf to a first prediction of the ORFs, and then uses this first prediction to do a second, more accurate prediction of the ORFs. genom.seq is just the ETEC p7 genome in FASTA format and run3 is name that will be prefix to the files that Glimmer generates.

This script generates a bunch of different text files with information, and the one that contains the information about the predicted ORFs is called run3.detail (but I had to rename it to run3.txt to be able to upload it). In this file a summary of the analysis can be found together with all the predicted ORFs. Glimmer predicted 132 potential ORFs in the ETEC p7 genome. Glimmer gives the starting position of each ORF and the starting position of its gene and the stop position of each ORF, together with the score of each predicted ORF.

Each gene has to be extracted now, based on the given positions given in the run3.detail file, and then homology searches have to be conducted for each sequence. I will do this and post the results for the homology searches in the next post.

Tips to get Glimmer to work

Before ending this post I want to add this section where I explain what issues I had with getting Glimmer running, maybe this will help someone in the future googling on how to get Glimmer working. As mentioned earlier it took some time and trial and error to get Glimmer to work. Both Glimmer and Elph don’t seem to have been updated since 2006. Some things are outdated and for some cases the explanation in the their documentation are not explained in greater details. These are the issues I encountered and how I fixed them.

Install tcsh

The script files are C shell scripts and there is a specific C shell to run them from. I installed the tcsh shell. It is probably not needed and the default shell of the Linux system will work, but since it is easy to install the tcsh shell I used it to be on the safe site. On Ubuntu use:

sudo apt install tcsh

Then write tcsh in the terminal to enter the C shell you just installed.

Edit pathways in the script files

In the g3-iterat-csh file you need to adjust three pathway to where you have saved the scripts and Elph. This is straight forward but notice that for awkpath you have to put the pathway to the scripts folder in the Glimmer folder, for glimmerpath you have to set the pathway to the bin folder in the Glimmer folder, and for the elphbin you have to set the pathway to the Elph binary file.

Add awk -f in the script file

If you try and run the g3-iterate.csh script now you will most likely get an error say that certain files are not commands. To fix two of these error open the g3-iterate.csh file and add awk -f in front of:

$awkpath/upstream-coords.awk 25 0 $tag.coords \

and in front of

$awkpath/get-motif-counts.awk > $tag.motif

Both of these of lines of codes are in step six. Save and close the file.

Compile Elph for 64 bit systems

If you run the script now you still get the third error which says that Elph is not a command. This is because Elph is compiled for and i386 system (32 bit system), and probably all computers now a days are 64 bit systems. There are several ways of solving this, but since Elph is open source the easiest way is probably to just compile the source code on your computer. Navigate to sources folder in the Elph folder from the terminal and write make. Some new file are generated. If you want to keep you files tidy move the newly generated files (elph, elph.o, GAargs.o, Gbase.o, GString.o and motif.o) to a new folder (preferably in the bin folder). Don’t forget to update the path to the elph binary file in this folder in the g3-iterate.csh file.

Now when you run the g3-iterate.csh script it should work.

Some minor stuff that are good to know.

  • Don’t forget to put ./ before the script file name to be able to run it. So write ./g3-iterate.csh sequence.seq prefix-name to run the script
  • Don’t forget to make sure the files are executable. Either by the command chmod -x filename or in the graphical interface by right clicking on the file, choosing Properties and then choosing the executable option.

Assessment of the coverage

This post is just some description and reflection on the coverage of the data, since it is good to know if there are any “weak spots” of the assembly.

I downloaded and installed Geneious Prime with a 14 days trial license. I aligned the paired end reads to the assembled contig. Then I used the built in tool for calculation of coverage in Geneious Prime (with default settings). The output showed regions of high coverage and regions of low coverage. As seen in the image below the high coverage region (yellow part) includes the region of the terminal repeats that was found with PhageTerm (and now confirmed in Geneious Prime). The regions of low coverage (red part) are just at the ends of the assembly, where it would be expected to be a lower reads due to lower quality of the reads. Overall the coverage seems to be good with no gaps.

The assembled contig with annotation of the terminal repeats (orange part) and coverage. Yellow part for high coverage and red for low coverage.

Introduction to P7 phages and determining nature of chromosome ends

Research and background

The P7 bacteriophage belong to the order of Caudovirales, which contain a single linear double stranded DNA (dsDNA) and a have a tail. This order has three know families, Siphoviridae, Myoviridae and Podoviridae. The difference between these families is that they have different types of tails. The P7 bacteriophage belongs to the Myoviridae phages, which have a complex contractile tail. The mechanisms for DNA replication and packaging into procapsid can differ between different species of Caudovirales. By analyzing and determining the nature of the ends of the chromosomes it can be shed a light on the replication strategy of the bacteriophage.

Caudovirales have six know types of terminal ends. Phages use these different terminal ends to recognize their own DNA, rather than the DNA of their host’s. Most phages from this order package the DNA in a procapsid from concatemeric (repeating) DNA molecules that are frequently the result of rolling circular replication mechanisms. For P7 bacteriophages (that belong to the species of P1 bacteriophages) the mechanism of packaging is one that is called headful packaging, using a pac site. The pac site is where the terminase can initiate packaging. This leads to phages that have chromosomes that are terminally redundant and circularly permuted. An analysis of the terminals should confirm this.

After some research it seems there are two approaches of characterizing the termini of phages. The first one, that also was recommended by Professor Nilsson, is to use the software Geneious to look for regions of higher coverage. Since the terminal ends are repeats it is expected that this regions also have higher coverage. This should be combined with comparing the phage genome with a similar bacteriophage that has already been characterized, to be able to pinpoint the terminal repeats.

The second approach is to use the software PhageTerm. This software is freely available and uses the same principal as described above, by looking at regions of the data with a significantly higher number of reads compared to the rest of the genome. The advantage is that, unlike using Geneious which require experience to determine the terminals, PhageTerm uses a theoretical and statistical framework to determine the terminal repeats. Other advantages of PhageTerm are that it has been specifically investigated with Illumina technologies, tested with a range of de novo assembled bacteriophages and developed for dsDNA bacteriophages.


PhageTerm is developed by researchers at the Pasteur Institute and the institute also hosts PhageTerm on a Galaxy wrapper. This instance of PhageTerm was used to analyze the terminals of the genome assembled phage genome. The paired-end data and the assembled genome were given as inputs, with the default settings (seed length = 20, peak surrounding region = 20, limit coverage = 250). This resulted in a report [PDF] that put the starting position of the terminal repeats at 13344 and the ending position at 13592, which makes the terminal repeats 248 bps long. PhageTerm also classifies the ends as redundant and non permuting. If I understand the report correctly it identifies the genome as belonging to a T7 bacteriophage, but this needs to be discussed with professor Nilsson, since the information we were given was that genome should belong to a P7 bacteriophage. The difference between P1/p7 bacteriophages and T7 bacteriophages are that the chromosome ends of P1/P7 phages are permuted and the chromosome ends of T7 phages are not permuted.

PhageTerm also generates a file containing the phage genome sequence reorganized according to termini positions. It is unclear if we should proceed with this new reorganized file of the genome or continue with the genome that was assembled with SPAdes. This also needs to be discussed with professor Nilsson.

The starting and ending position of the terminal repeats as identified by the increase amounts of reads between these positions.
Summary of the analysis performed by PhageTerm.

Assembly and more quality controls

Last week we focused on doing more quality controls, and we started with doing alignments. After the adapter of the reads had been removed a quality control of the reads with FastQC clearly showed an increase in the quality of the data.

Then we proceeded to do trim and filter the reads with FastX. Trimming is when bases with poor quality are removed, and filtering is when entire reads of poor quality are removed from the dataset, either due to poor average quality, ambiguous base calling or short length. FastX is a collection of command line tools for pre-processing short reads. Using FastX to filter the data did not result in any changes of the quality of the data. The file sizes of the fastq files before and after remained the same and analyzing the files with FastQC confirmed this since there were no changes in the plots compared to the plots before filtering. The per base quality plot below is given as an example of no changes to before the filtering.

The per base quality plot after filtering the reads.

We decided to continue without the filtered reads.

Next we tried assembling the reads into contigs with two different assemblers: Velvet and SPAdes. After discussion with Anders Nilsson and consulting documentation from Illumina it was decided to assemble the reads with Velvet with three different k-mer sizes: 21, 41 and 61. SPAdes has an algorithm for calculating the k-mer size. In the literature it is also recommended to assemble with a lower coverage that is common when assembling phage genomes. This is because phage genomes are small and with a high coverage there is a risk that systematic errors would be treated as natural variations. For this reason the reads were also assembled with only 10% of the reads, for both Velvet and for SPAdes. In addition, when the reads were assembled with SPAdes they were assemble with the setting “careful” on and off. In total ten assemblies were made using SPAdes and Velvet.

The software Quast was used to analyze and require metrics for all assemblies. The result reported by Quast was (only contigs > 500 bp reported):

Velvet 21 k-mer:
Number of contigs: 32
Largest contig: 2926 bp
N50: 749 bp
Total length: 24391 bp

Velvet 41 k-mer:
Number of contigs: 43
Largest contig: 1730 bp
N50: 711 bp
Total length: 32159 bp

Velvet 61 k-mer:
Number of contigs: 63
Largest contig: 1241 bp
N50: 660 bp
Total length: 42473 bp

Velvet 21 k-mer (10% coverage):
Number of contigs: 63
Largest contig: 5390 bp
N50: 1862 bp
Total length: 85004 bp

Velvet 41 k-mer (10% coverage):
Number of contigs: 64
Largest contig: 7011 bp
N50: 2777 bp
Total length: 100054 bp

Velvet 61 k-mer (10% coverage):
Number of contigs: 60
Largest contig: 14272 bp
N50: 2194 bp
Total length: 103048 bp

SPAdes (careful):
Number of contigs: 34
Largest contig: 90035 bp
N50: 76572 bp
Total length: 193763 bp

SPAdes (uncareful):
Number of contigs: 34
Largest contig: 90035 bp
N50: 76700 bp
Total length: 193891 bp

SPAdes (10% coverage and careful):
Number of contigs: 8
Largest contig: 90035 bp
N50: 90035 bp
Total length: 169704 bp

SPAdes (10% coverage and uncareful):
Number of contigs: 7
Largest contig: 90035 bp
N50: 90035 bp
Total length: 169705 bp

We are unsure how to interpret which of these assemblies are the best based on these metrics, but the genome of the phage is expected to be 80-90 kb in total. None of the SPAdes assemblies fall within this range. Out of the Velvet assemblies only the 21 k-mer with 10% coverage fall with in the expected genome size. But the best assembly still remains to be discussed.

Quality control: FastQC before and after adapter removal

Since we realized that we need to quality controls of the reads and remove adapters and trimming and filtering we have found the software FastQC which is a quality control tool for high throughput data. FastQC does not modify the reads. It just give different kinds of graphs that report the quality of the reads.

We analyzed the original dataset of reads with FastQC and the quality of the data is reported as good even though some categories give the failure and warnings. The per base sequence quality shows that the general quality of the bases is good even though it starts to drop by the end of the reads.

The per base quality before the adapter removal.

The category for sequence duplication levels is the only category that gives a failure. This may indicate some kind of enrichment.

Sequence duplication levels of the raw reads.

The category for over-represented sequences gives a warning. A sequence is regarded as over-represented and the software will raise a warning if the sequence makes up more than 0.1% of the total number of sequences. An over-represented sequence may be due to biological importance or to contamination. In the table it can be seen that the over-representation of two of the sequences are due to the Illumina adapter sequence.

Overrepresented reads in the dataset.

The category for adapter content also gives a warning. The graph shows, though, that the source of the warning is a significant amount of Illumina adapter sequences.

The graph for adapter content shows a significant amount of Illumina adapter sequences.

The analysis of the raw reads shows that there is a significant amount of Illumina adapter sequences in the dataset and thus adapter removal should be performed. This was previously done with Trimmomatic and the resulting reads were again analyzed with FastQC.

The per base quality improved drastically compared to before the removal of the adapters as seen in the image below.

The per base quality improved after the adapters were removed.

Before the adapter removal the distribution of sequence lengths had a perfect score since all the sequences where 300 nucleotides. After the adapters were removed the distribution of sequence lengths changed, as expected. Most of the sequences are very close to 300 nucleotides. According to the FastQC manual the software raises a warning as if all the sequences are not the same length. But this should not be a big issue in this case.

The sequence length distribution changed from a pass to a warning after the adapter removal.

The sequence duplication levels did not change much after the removal of the adapters compared to before the adapter removal and still raises a failure, indicating an enrichment bias. The software raises a failure if more than 50% of the total amount of sequences are non-unique. I’m not sure if this will cause an issue with the assembly but we decided to continue with the next steps without looking closer into this issue.

The sequence duplication levels did not change compared to before the adapters were removed.

In the table of overrepresented sequences it can be seen that the sequences of the adapter have been removed, as expected. The rest of them from before the removal are still present and their sources are still unknown but should probably be BLASTed to find out more about their importance.

In the overrepresented table adapter sequences have been removed.

Finally, as expected the graph for adapter content shows that all the adapters were removed. This category turned from a warning to a pass.

The adapter content shows that all the adapters were removed.

In conclusion we decided that the adapter removal improved the data quality and decided to continue with the adapter removed reads.

Quality control: Removing adapters from reads in dataset

The group met today and we added quality control of the reads to the project plan. We looked at using either Trimmomatic or Cutadapt. Trimmomatic would be the preferred option since it is a trimming tool for Illumina NGS data. The adapter sequences to be removed are also distributed with the software, unlike Cutadapt, where the user has to specify the adapter sequences that should be removed.

According to the manual of Trimmomatic a FASTA file should be specified (in addition to the dataset) that contains the adapter sequences (and PCR sequences etc). This file is distributed with Trimmomatic and contains the Illumina adapter sequences. It does not really make sense to me that the pathway of this file needs to be specified since it is distributed with the software, and since the software only works with data from Illumina sequencing machines there is not a lot of different options for the user to specify. Finding this file in a distributed system like Uppmax is what took the most time in trying to use this software. The solution was instead to find this FASTA file with the Illumina sequences on the internet and upload it to the same folder as the files with the reads.

The options used for Trimmomatic were the default options that are specified in the example of the webpage of Trimmomatic (for paired end data):

module add bioinfo-tools

module add trimmomatic

java -jar $TRIMMOMATIC_HOME/trimmomatic.jar PE -phred33 ETECp7_TCCGCGAA-CAGGACGT_L001_R1_001.fastq.gz ETECp7_TCCGCGAA-CAGGACGT_L001_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

And the output was:

TrimmomaticPE: Started with arguments:
-phred33 ETECp7_TCCGCGAA-CAGGACGT_L001_R1_001.fastq.gz ETECp7_TCCGCGAA-CAGGACGT_L001_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 799754 Both Surviving: 718152 (89.80%) Forward Only Surviving: 77273 (9.66%) Reverse Only Surviving: 631 (0.08%) Dropped: 3698 (0.46%)
TrimmomaticPE: Completed successfully

I can’t really interpret the summary of the output. Are the results good or bad? But will look into it tomorrow.