Characterization of genes predicted by Glimmer

In the last post I used Glimmer to predict ORFs. Glimmer generated a list of possible ORF together with the start and end position of each ORF and the containing gene. I did homology searches with BLASTX for all of the genes.  The only setting that was changed was the Genetic code. It was changed to Bacteria and Archaea (11). The rest of the setting were the default settings. In this post I will make a summary of the findings of the homology searches.

Glimmer predicted 132 ORFs. Just by looking at the start and end position of each one there is one that is most likely wrongly predicted. It start at position 1090 and ends at 76526. Since this is almost the entire genome it should, with my basic knowledge about ORFs, genes and coding regions, be incorrectly predicted. There might be more genes that are incorrectly predicted, but these are not obvious to point out with the level of knowledge that we have gained up until now, even though there are some hypothetical proteins that are predicted more than once in almost the same positions, or running in opposite directions and overlapping each other. These can most likely be solved by looking more closely at the sequences of the genes and comparing them to close relatives of ETEC p7 to figure out which ones of these doublets that are the correct ones, but due to time constraints this has not been done.

On the topic of hypothetical proteins, most of the predicted genes were classified as hypothetical proteins, and thus of unknown protein function, but most of these hypothetical proteins have homologs in bacteriophages like ECBP2, ECBP3, phi32 and others.

Out of the 131 ORFs (not counting the ORF that covered almost the entire genome) 34 (about 26%) of them had genes with coding proteins of known function. These have been listed in the table below. For three of the 131 ORF no matches could be found with BLASTX. Since these might also be of interest to further look at they have also been listed in the table. A graph image of the annotated genome can be seen further down in this post.

ORF/Gene Position (start-stop) Length Predicted function
1 3776-1134 2642 putative tail fiber
3 5153-4641 512 bacterial Ig-like domain
4 5222-5746 524 major head protein forward
5 6220-5165 1055 major head protein reverse
8 9835-7592 2243 putative portal protein
9 11454-9901 1553 Bacteriophage terminase large (ATPase) subunit
14 14950-15063 113 Rossmann fold nucleotide-binding protein (very low score on BLASTX)
15 15066-15338 272 Putative protein (no matches with BLASTX)
17 16124-16318 194 Putative protein (no matches with BLASTX)
51 29229-29699 470 HNH homing endonuclease # Phage intron
61 36552-36983 431 Gamma-glutamyl cyclotransferase
63 37227-39002 1775 primase/helicase
64 38996-39544 548 DNA polymerase
66 41016-41549 533 deoxycytidine triphosphate deaminase
70 42403-42930 527 putative integrase
77 44316-44963 647 putative thymidylate synthase protein
78 44930-45097 167 putative NAD+ diphosphatase
81 45617-45835 218 NAD-dependent DNA ligase
83 46004-46744 740 putative PhoH-like protein
84 46753-46908 155 Putative protein (no matches with BLASTX)
89 48562-50277 1715 DNA polymerase
90 50268-50744 476 homing endonuclease
91 50861-51016 155 DNA_pol_A_pol_I_B
105 53464-53967 503 putative serine/threonine protein
109 55750-56136 386 HNH endonuclease
113 56596-57246 650 RNA polymerase sigma factor SigX
115 57811-58653 842 exonuclease
118 59212-59511 299 WYL domain
119 64185-59740 4445 chromosome segregation protein
122 67968-66916 1052 hemagglutinin protein
123 68757-67978 779 internal virion protein
126 70789-69821 968 putative tail fiber protein
127 73857-70837 3020 bacterial surface protein
128 74670-73867 803 putative baseplate protein
129 75174-74683 491 Phage-related lysozyme (muramidase)
130 75426-75208 218 putative holin
132 76529-75516 1013 tail fiber protein

The contig starts and ends with putative tail proteins, and as the start and end of the genome are at the terminal repeats it is probable that the two tail protein sequences are either part of the same sequence or two different tail proteins the are located directly after each other. This can be investigated further by finding the correct arrangement of the genome. PhageTerm is a software that is able to make a prediction of the correct arrangement based on the terminal repeats. This can also be investigated by researching homologs of ETEC p7.

These sequences of know protein function can be divided into four categories: Structural proteins (e.g. tail fiber protein, major head protein, portal protein), replication assisting proteins (e.g. DNA polymerase, NAD-dependent DNA ligase), nucleotide metabolism proteins (e.g. deoxycytidine triphosphate deaminase) and other proteins (e.g. PhoH-like protein).

The extensive amount of proteins that play a part in the replication and recombination processes suggest that ETEC p7 is a virulent bacteriophage. To this end it is of particular interest that the genome contains a gene with a lysozyme. Since these enzymes cause the lysis of bacterial cell walls they are of specific interest in the research of new antibiotics. With more time this gene should have been investigated in more detail with potential homologs to determine its novelty.

Annotation of bacteriophage ETEC p7
Annotation of genome of ETEC p7. Sequences with proteins of known function are in orange. Hypothetical proteins are in green. The terminal repeat is in red. The three sequences with no matches with BLASTX are in purple. Parts of low coverage are in dark red (ends of the contig) and regions of high coverage are in yellow. Graph generated with Geneious version 2019.0 created by Biomatters. Available from https://www.geneious.com.

Prediction of coding regions with Glimmer

Background and research

The past few days I have been reading up on and trying to understand how the software package Glimmer works, both theoretically and practically. This has been done by reading the paper accompanying the software [pdf], reading the manual, and testing.

Glimmer is a software package used to find and predict genes in bacteria archea and viruses. There are three major version of the software and for the predictions of the genes of ETEC p7 current version of 3.02 was used.

According to the paper accompanying the software genomes of microbes, including viruses, are very dense in genes and thus the accuracy of gene finding softwares is very high compared to finding genes in eukaryotes. Glimmer has a sensitivity of over 99% of detecting genes of prokaryotes and viruses. The third version made improvements in sensitivity of detecting genes by making improving the prediction of start sites. The developers also made improvements in detecting false positives. Since prokaryotes are have a high density in genes it is difficult to say that a predicted gene is a false positive. In previous versions of Glimmer a source of false postitives was also the prediction of too many overlapping genes. Overlapping genes is rare is bacterial and virus genomes. The solution the developers of Glimmer have chosen to reduce the rate of false-positives is to check the homology of the predicted genes with (close) relatives of the genome being analyzed.

At its core Glimmer makes predictions of genes in genomes of bacteria and viruses by creating a variable-length Markov model from a training set of genes and then uses the model to predict the genes in a DNA sequence.

Glimmer works in a two step mode. First the training set of genes (interpolated context model, ICM) has to be built with the software build-icm. This can be done in several ways depending on how much data or knowledge there is about the genome being analyzed. If the genes of the genome are known, for example from homology studies, then this is the best option. Other options are to use the program long-orfs (shipped with the Glimmer package) to generate long, non-overlapping ORFs from the genome, and to use genes in a highly similar species. After the ICM is generated the software glimmer3 is used to run the analysis to make the gene predictions.

Previously my group mates have done homology searches as they have BLASTed the genome of the ETEC p7 against the genome of SU10, which has the highest identity to the ETEC p7 genome. This resulted in a list of 26 genes that were matched on SU10 genome. They have also searched for ORFs using NCBI ORFfinder. A tool from NCBI that searches for ORFs in DNA sequences and returns the range of the ORFs together with the protein translations for each ORF. This list was much larger containing 108 ORFs. Comparing the list of genes and ORFs, there were only five items that probably were the same coding sequences. There are some limitations with ORFfinder. First of all the online version of ORFfinder limits the query sequence up to 50 kb. Second it uses start codons of “ATG” and alternative initiation codons or any sense codon to identify ORFs. Third, there is an option to choose the genetic code that is specific for the organism that is used with ORFfinder, and even though ORFfinder works better with prokaryotes and viral genomes there is no option for viral genomes, and the standard genetic code has to be chosen. For all of the reasons presented in this paragraph I decided not to use Glimmer with the list of genes that was generated through alignment with SU10 since it can’t be trusted that the genes belong to ETEC p7.

Experiment

I used the option described in the manual of Glimmer where long-orfs is used to find a training set of putative genes from the ETEC p7 genome, to create a training set from this information and to predict the genes with Glimmer. The predicted genes is then used as the training set in a second run with Glimmer. This approach is the best one when there are no know genes of the genome. Also, Glimmer comes shipped with a set of pre-written C-shell scripts that minimizes the need to write the scripts to achieve this approach.

There was a lot of issues to get Glimmer to run the analysis. It took a lot of time and googling to get it to work. I downloaded Glimmer (which is open source) and compiled the binaries. In the manual it says that in addition to Glimmer the general purpose Gibbs sampler Elph is also needed, which I also downloaded. To run the analysis I used the script named g3-iterated.csh that can be found in the scripts folder of Glimmer. The paths inside this script need to be adjusted. The script was run from the terminal with default settings and as is instructed in the manual.

g3-iterated.csh genom.seq run3

The g3-iterated.csh script is a C shell script that uses the long-orf to a first prediction of the ORFs, and then uses this first prediction to do a second, more accurate prediction of the ORFs. genom.seq is just the ETEC p7 genome in FASTA format and run3 is name that will be prefix to the files that Glimmer generates.

This script generates a bunch of different text files with information, and the one that contains the information about the predicted ORFs is called run3.detail (but I had to rename it to run3.txt to be able to upload it). In this file a summary of the analysis can be found together with all the predicted ORFs. Glimmer predicted 132 potential ORFs in the ETEC p7 genome. Glimmer gives the starting position of each ORF and the starting position of its gene and the stop position of each ORF, together with the score of each predicted ORF.

Each gene has to be extracted now, based on the given positions given in the run3.detail file, and then homology searches have to be conducted for each sequence. I will do this and post the results for the homology searches in the next post.

Tips to get Glimmer to work

Before ending this post I want to add this section where I explain what issues I had with getting Glimmer running, maybe this will help someone in the future googling on how to get Glimmer working. As mentioned earlier it took some time and trial and error to get Glimmer to work. Both Glimmer and Elph don’t seem to have been updated since 2006. Some things are outdated and for some cases the explanation in the their documentation are not explained in greater details. These are the issues I encountered and how I fixed them.

Install tcsh

The script files are C shell scripts and there is a specific C shell to run them from. I installed the tcsh shell. It is probably not needed and the default shell of the Linux system will work, but since it is easy to install the tcsh shell I used it to be on the safe site. On Ubuntu use:

sudo apt install tcsh

Then write tcsh in the terminal to enter the C shell you just installed.

Edit pathways in the script files

In the g3-iterat-csh file you need to adjust three pathway to where you have saved the scripts and Elph. This is straight forward but notice that for awkpath you have to put the pathway to the scripts folder in the Glimmer folder, for glimmerpath you have to set the pathway to the bin folder in the Glimmer folder, and for the elphbin you have to set the pathway to the Elph binary file.

Add awk -f in the script file

If you try and run the g3-iterate.csh script now you will most likely get an error say that certain files are not commands. To fix two of these error open the g3-iterate.csh file and add awk -f in front of:

$awkpath/upstream-coords.awk 25 0 $tag.coords \

and in front of

$awkpath/get-motif-counts.awk > $tag.motif

Both of these of lines of codes are in step six. Save and close the file.

Compile Elph for 64 bit systems

If you run the script now you still get the third error which says that Elph is not a command. This is because Elph is compiled for and i386 system (32 bit system), and probably all computers now a days are 64 bit systems. There are several ways of solving this, but since Elph is open source the easiest way is probably to just compile the source code on your computer. Navigate to sources folder in the Elph folder from the terminal and write make. Some new file are generated. If you want to keep you files tidy move the newly generated files (elph, elph.o, GAargs.o, Gbase.o, GString.o and motif.o) to a new folder (preferably in the bin folder). Don’t forget to update the path to the elph binary file in this folder in the g3-iterate.csh file.

Now when you run the g3-iterate.csh script it should work.

Miscellaneous
Some minor stuff that are good to know.

  • Don’t forget to put ./ before the script file name to be able to run it. So write ./g3-iterate.csh sequence.seq prefix-name to run the script
  • Don’t forget to make sure the files are executable. Either by the command chmod -x filename or in the graphical interface by right clicking on the file, choosing Properties and then choosing the executable option.