Prediction of coding regions with Glimmer

Background and research

The past few days I have been reading up on and trying to understand how the software package Glimmer works, both theoretically and practically. This has been done by reading the paper accompanying the software [pdf], reading the manual, and testing.

Glimmer is a software package used to find and predict genes in bacteria archea and viruses. There are three major version of the software and for the predictions of the genes of ETEC p7 current version of 3.02 was used.

According to the paper accompanying the software genomes of microbes, including viruses, are very dense in genes and thus the accuracy of gene finding softwares is very high compared to finding genes in eukaryotes. Glimmer has a sensitivity of over 99% of detecting genes of prokaryotes and viruses. The third version made improvements in sensitivity of detecting genes by making improving the prediction of start sites. The developers also made improvements in detecting false positives. Since prokaryotes are have a high density in genes it is difficult to say that a predicted gene is a false positive. In previous versions of Glimmer a source of false postitives was also the prediction of too many overlapping genes. Overlapping genes is rare is bacterial and virus genomes. The solution the developers of Glimmer have chosen to reduce the rate of false-positives is to check the homology of the predicted genes with (close) relatives of the genome being analyzed.

At its core Glimmer makes predictions of genes in genomes of bacteria and viruses by creating a variable-length Markov model from a training set of genes and then uses the model to predict the genes in a DNA sequence.

Glimmer works in a two step mode. First the training set of genes (interpolated context model, ICM) has to be built with the software build-icm. This can be done in several ways depending on how much data or knowledge there is about the genome being analyzed. If the genes of the genome are known, for example from homology studies, then this is the best option. Other options are to use the program long-orfs (shipped with the Glimmer package) to generate long, non-overlapping ORFs from the genome, and to use genes in a highly similar species. After the ICM is generated the software glimmer3 is used to run the analysis to make the gene predictions.

Previously my group mates have done homology searches as they have BLASTed the genome of the ETEC p7 against the genome of SU10, which has the highest identity to the ETEC p7 genome. This resulted in a list of 26 genes that were matched on SU10 genome. They have also searched for ORFs using NCBI ORFfinder. A tool from NCBI that searches for ORFs in DNA sequences and returns the range of the ORFs together with the protein translations for each ORF. This list was much larger containing 108 ORFs. Comparing the list of genes and ORFs, there were only five items that probably were the same coding sequences. There are some limitations with ORFfinder. First of all the online version of ORFfinder limits the query sequence up to 50 kb. Second it uses start codons of “ATG” and alternative initiation codons or any sense codon to identify ORFs. Third, there is an option to choose the genetic code that is specific for the organism that is used with ORFfinder, and even though ORFfinder works better with prokaryotes and viral genomes there is no option for viral genomes, and the standard genetic code has to be chosen. For all of the reasons presented in this paragraph I decided not to use Glimmer with the list of genes that was generated through alignment with SU10 since it can’t be trusted that the genes belong to ETEC p7.

Experiment

I used the option described in the manual of Glimmer where long-orfs is used to find a training set of putative genes from the ETEC p7 genome, to create a training set from this information and to predict the genes with Glimmer. The predicted genes is then used as the training set in a second run with Glimmer. This approach is the best one when there are no know genes of the genome. Also, Glimmer comes shipped with a set of pre-written C-shell scripts that minimizes the need to write the scripts to achieve this approach.

There was a lot of issues to get Glimmer to run the analysis. It took a lot of time and googling to get it to work. I downloaded Glimmer (which is open source) and compiled the binaries. In the manual it says that in addition to Glimmer the general purpose Gibbs sampler Elph is also needed, which I also downloaded. To run the analysis I used the script named g3-iterated.csh that can be found in the scripts folder of Glimmer. The paths inside this script need to be adjusted. The script was run from the terminal with default settings and as is instructed in the manual.

g3-iterated.csh genom.seq run3

The g3-iterated.csh script is a C shell script that uses the long-orf to a first prediction of the ORFs, and then uses this first prediction to do a second, more accurate prediction of the ORFs. genom.seq is just the ETEC p7 genome in FASTA format and run3 is name that will be prefix to the files that Glimmer generates.

This script generates a bunch of different text files with information, and the one that contains the information about the predicted ORFs is called run3.detail (but I had to rename it to run3.txt to be able to upload it). In this file a summary of the analysis can be found together with all the predicted ORFs. Glimmer predicted 132 potential ORFs in the ETEC p7 genome. Glimmer gives the starting position of each ORF and the starting position of its gene and the stop position of each ORF, together with the score of each predicted ORF.

Each gene has to be extracted now, based on the given positions given in the run3.detail file, and then homology searches have to be conducted for each sequence. I will do this and post the results for the homology searches in the next post.

Tips to get Glimmer to work

Before ending this post I want to add this section where I explain what issues I had with getting Glimmer running, maybe this will help someone in the future googling on how to get Glimmer working. As mentioned earlier it took some time and trial and error to get Glimmer to work. Both Glimmer and Elph don’t seem to have been updated since 2006. Some things are outdated and for some cases the explanation in the their documentation are not explained in greater details. These are the issues I encountered and how I fixed them.

Install tcsh

The script files are C shell scripts and there is a specific C shell to run them from. I installed the tcsh shell. It is probably not needed and the default shell of the Linux system will work, but since it is easy to install the tcsh shell I used it to be on the safe site. On Ubuntu use:

sudo apt install tcsh

Then write tcsh in the terminal to enter the C shell you just installed.

Edit pathways in the script files

In the g3-iterat-csh file you need to adjust three pathway to where you have saved the scripts and Elph. This is straight forward but notice that for awkpath you have to put the pathway to the scripts folder in the Glimmer folder, for glimmerpath you have to set the pathway to the bin folder in the Glimmer folder, and for the elphbin you have to set the pathway to the Elph binary file.

Add awk -f in the script file

If you try and run the g3-iterate.csh script now you will most likely get an error say that certain files are not commands. To fix two of these error open the g3-iterate.csh file and add awk -f in front of:

$awkpath/upstream-coords.awk 25 0 $tag.coords \

and in front of

$awkpath/get-motif-counts.awk > $tag.motif

Both of these of lines of codes are in step six. Save and close the file.

Compile Elph for 64 bit systems

If you run the script now you still get the third error which says that Elph is not a command. This is because Elph is compiled for and i386 system (32 bit system), and probably all computers now a days are 64 bit systems. There are several ways of solving this, but since Elph is open source the easiest way is probably to just compile the source code on your computer. Navigate to sources folder in the Elph folder from the terminal and write make. Some new file are generated. If you want to keep you files tidy move the newly generated files (elph, elph.o, GAargs.o, Gbase.o, GString.o and motif.o) to a new folder (preferably in the bin folder). Don’t forget to update the path to the elph binary file in this folder in the g3-iterate.csh file.

Now when you run the g3-iterate.csh script it should work.

Miscellaneous
Some minor stuff that are good to know.

  • Don’t forget to put ./ before the script file name to be able to run it. So write ./g3-iterate.csh sequence.seq prefix-name to run the script
  • Don’t forget to make sure the files are executable. Either by the command chmod -x filename or in the graphical interface by right clicking on the file, choosing Properties and then choosing the executable option.