Prediction of coding regions with Glimmer

Background and research

The past few days I have been reading up on and trying to understand how the software package Glimmer works, both theoretically and practically. This has been done by reading the paper accompanying the software [pdf], reading the manual, and testing.

Glimmer is a software package used to find and predict genes in bacteria archea and viruses. There are three major version of the software and for the predictions of the genes of ETEC p7 current version of 3.02 was used.

According to the paper accompanying the software genomes of microbes, including viruses, are very dense in genes and thus the accuracy of gene finding softwares is very high compared to finding genes in eukaryotes. Glimmer has a sensitivity of over 99% of detecting genes of prokaryotes and viruses. The third version made improvements in sensitivity of detecting genes by making improving the prediction of start sites. The developers also made improvements in detecting false positives. Since prokaryotes are have a high density in genes it is difficult to say that a predicted gene is a false positive. In previous versions of Glimmer a source of false postitives was also the prediction of too many overlapping genes. Overlapping genes is rare is bacterial and virus genomes. The solution the developers of Glimmer have chosen to reduce the rate of false-positives is to check the homology of the predicted genes with (close) relatives of the genome being analyzed.

At its core Glimmer makes predictions of genes in genomes of bacteria and viruses by creating a variable-length Markov model from a training set of genes and then uses the model to predict the genes in a DNA sequence.

Glimmer works in a two step mode. First the training set of genes (interpolated context model, ICM) has to be built with the software build-icm. This can be done in several ways depending on how much data or knowledge there is about the genome being analyzed. If the genes of the genome are known, for example from homology studies, then this is the best option. Other options are to use the program long-orfs (shipped with the Glimmer package) to generate long, non-overlapping ORFs from the genome, and to use genes in a highly similar species. After the ICM is generated the software glimmer3 is used to run the analysis to make the gene predictions.

Previously my group mates have done homology searches as they have BLASTed the genome of the ETEC p7 against the genome of SU10, which has the highest identity to the ETEC p7 genome. This resulted in a list of 26 genes that were matched on SU10 genome. They have also searched for ORFs using NCBI ORFfinder. A tool from NCBI that searches for ORFs in DNA sequences and returns the range of the ORFs together with the protein translations for each ORF. This list was much larger containing 108 ORFs. Comparing the list of genes and ORFs, there were only five items that probably were the same coding sequences. There are some limitations with ORFfinder. First of all the online version of ORFfinder limits the query sequence up to 50 kb. Second it uses start codons of “ATG” and alternative initiation codons or any sense codon to identify ORFs. Third, there is an option to choose the genetic code that is specific for the organism that is used with ORFfinder, and even though ORFfinder works better with prokaryotes and viral genomes there is no option for viral genomes, and the standard genetic code has to be chosen. For all of the reasons presented in this paragraph I decided not to use Glimmer with the list of genes that was generated through alignment with SU10 since it can’t be trusted that the genes belong to ETEC p7.

Experiment

I used the option described in the manual of Glimmer where long-orfs is used to find a training set of putative genes from the ETEC p7 genome, to create a training set from this information and to predict the genes with Glimmer. The predicted genes is then used as the training set in a second run with Glimmer. This approach is the best one when there are no know genes of the genome. Also, Glimmer comes shipped with a set of pre-written C-shell scripts that minimizes the need to write the scripts to achieve this approach.

There was a lot of issues to get Glimmer to run the analysis. It took a lot of time and googling to get it to work. I downloaded Glimmer (which is open source) and compiled the binaries. In the manual it says that in addition to Glimmer the general purpose Gibbs sampler Elph is also needed, which I also downloaded. To run the analysis I used the script named g3-iterated.csh that can be found in the scripts folder of Glimmer. The paths inside this script need to be adjusted. The script was run from the terminal with default settings and as is instructed in the manual.

g3-iterated.csh genom.seq run3

The g3-iterated.csh script is a C shell script that uses the long-orf to a first prediction of the ORFs, and then uses this first prediction to do a second, more accurate prediction of the ORFs. genom.seq is just the ETEC p7 genome in FASTA format and run3 is name that will be prefix to the files that Glimmer generates.

This script generates a bunch of different text files with information, and the one that contains the information about the predicted ORFs is called run3.detail (but I had to rename it to run3.txt to be able to upload it). In this file a summary of the analysis can be found together with all the predicted ORFs. Glimmer predicted 132 potential ORFs in the ETEC p7 genome. Glimmer gives the starting position of each ORF and the starting position of its gene and the stop position of each ORF, together with the score of each predicted ORF.

Each gene has to be extracted now, based on the given positions given in the run3.detail file, and then homology searches have to be conducted for each sequence. I will do this and post the results for the homology searches in the next post.

Tips to get Glimmer to work

Before ending this post I want to add this section where I explain what issues I had with getting Glimmer running, maybe this will help someone in the future googling on how to get Glimmer working. As mentioned earlier it took some time and trial and error to get Glimmer to work. Both Glimmer and Elph don’t seem to have been updated since 2006. Some things are outdated and for some cases the explanation in the their documentation are not explained in greater details. These are the issues I encountered and how I fixed them.

Install tcsh

The script files are C shell scripts and there is a specific C shell to run them from. I installed the tcsh shell. It is probably not needed and the default shell of the Linux system will work, but since it is easy to install the tcsh shell I used it to be on the safe site. On Ubuntu use:

sudo apt install tcsh

Then write tcsh in the terminal to enter the C shell you just installed.

Edit pathways in the script files

In the g3-iterat-csh file you need to adjust three pathway to where you have saved the scripts and Elph. This is straight forward but notice that for awkpath you have to put the pathway to the scripts folder in the Glimmer folder, for glimmerpath you have to set the pathway to the bin folder in the Glimmer folder, and for the elphbin you have to set the pathway to the Elph binary file.

Add awk -f in the script file

If you try and run the g3-iterate.csh script now you will most likely get an error say that certain files are not commands. To fix two of these error open the g3-iterate.csh file and add awk -f in front of:

$awkpath/upstream-coords.awk 25 0 $tag.coords \

and in front of

$awkpath/get-motif-counts.awk > $tag.motif

Both of these of lines of codes are in step six. Save and close the file.

Compile Elph for 64 bit systems

If you run the script now you still get the third error which says that Elph is not a command. This is because Elph is compiled for and i386 system (32 bit system), and probably all computers now a days are 64 bit systems. There are several ways of solving this, but since Elph is open source the easiest way is probably to just compile the source code on your computer. Navigate to sources folder in the Elph folder from the terminal and write make. Some new file are generated. If you want to keep you files tidy move the newly generated files (elph, elph.o, GAargs.o, Gbase.o, GString.o and motif.o) to a new folder (preferably in the bin folder). Don’t forget to update the path to the elph binary file in this folder in the g3-iterate.csh file.

Now when you run the g3-iterate.csh script it should work.

Miscellaneous
Some minor stuff that are good to know.

  • Don’t forget to put ./ before the script file name to be able to run it. So write ./g3-iterate.csh sequence.seq prefix-name to run the script
  • Don’t forget to make sure the files are executable. Either by the command chmod -x filename or in the graphical interface by right clicking on the file, choosing Properties and then choosing the executable option.

Introduction to P7 phages and determining nature of chromosome ends

Research and background

The P7 bacteriophage belong to the order of Caudovirales, which contain a single linear double stranded DNA (dsDNA) and a have a tail. This order has three know families, Siphoviridae, Myoviridae and Podoviridae. The difference between these families is that they have different types of tails. The P7 bacteriophage belongs to the Myoviridae phages, which have a complex contractile tail. The mechanisms for DNA replication and packaging into procapsid can differ between different species of Caudovirales. By analyzing and determining the nature of the ends of the chromosomes it can be shed a light on the replication strategy of the bacteriophage.

Caudovirales have six know types of terminal ends. Phages use these different terminal ends to recognize their own DNA, rather than the DNA of their host’s. Most phages from this order package the DNA in a procapsid from concatemeric (repeating) DNA molecules that are frequently the result of rolling circular replication mechanisms. For P7 bacteriophages (that belong to the species of P1 bacteriophages) the mechanism of packaging is one that is called headful packaging, using a pac site. The pac site is where the terminase can initiate packaging. This leads to phages that have chromosomes that are terminally redundant and circularly permuted. An analysis of the terminals should confirm this.

After some research it seems there are two approaches of characterizing the termini of phages. The first one, that also was recommended by Professor Nilsson, is to use the software Geneious to look for regions of higher coverage. Since the terminal ends are repeats it is expected that this regions also have higher coverage. This should be combined with comparing the phage genome with a similar bacteriophage that has already been characterized, to be able to pinpoint the terminal repeats.

The second approach is to use the software PhageTerm. This software is freely available and uses the same principal as described above, by looking at regions of the data with a significantly higher number of reads compared to the rest of the genome. The advantage is that, unlike using Geneious which require experience to determine the terminals, PhageTerm uses a theoretical and statistical framework to determine the terminal repeats. Other advantages of PhageTerm are that it has been specifically investigated with Illumina technologies, tested with a range of de novo assembled bacteriophages and developed for dsDNA bacteriophages.

Experiments

PhageTerm is developed by researchers at the Pasteur Institute and the institute also hosts PhageTerm on a Galaxy wrapper. This instance of PhageTerm was used to analyze the terminals of the genome assembled phage genome. The paired-end data and the assembled genome were given as inputs, with the default settings (seed length = 20, peak surrounding region = 20, limit coverage = 250). This resulted in a report [PDF] that put the starting position of the terminal repeats at 13344 and the ending position at 13592, which makes the terminal repeats 248 bps long. PhageTerm also classifies the ends as redundant and non permuting. If I understand the report correctly it identifies the genome as belonging to a T7 bacteriophage, but this needs to be discussed with professor Nilsson, since the information we were given was that genome should belong to a P7 bacteriophage. The difference between P1/p7 bacteriophages and T7 bacteriophages are that the chromosome ends of P1/P7 phages are permuted and the chromosome ends of T7 phages are not permuted.

PhageTerm also generates a file containing the phage genome sequence reorganized according to termini positions. It is unclear if we should proceed with this new reorganized file of the genome or continue with the genome that was assembled with SPAdes. This also needs to be discussed with professor Nilsson.

The starting and ending position of the terminal repeats as identified by the increase amounts of reads between these positions.
Summary of the analysis performed by PhageTerm.

Summary of the latest findings on the viral genome

The post will only be a summary of some knowledge that the team gain during the past week or so about the phage and the genome of the phage that we were assigned.

First of all, out of the ten assemblies we decided to continue with the one that was assembled with SPAdes, where all reads where given as input and the setting careful was used. This resulted in an assembly with 34 contigs, but where two of the contigs make up the majority of the total sizes. One of these contigs is 90000 bp and the other one is 76000 bp. The rest of the contigs are about 3000 bp or shorter. This makes us think that maybe one of the two larger contigs might be the phage genome. Later it was confirmed by Anders Nilsson that both of these contigs are phage genomes and from two different phages. The the smaller contig is the genome that the research group was looking for, and so in the continued work of this project we have started to characterize and annotate this genome. Anders also mentioned that this phage is virulent and so we expect and have to some extent confirmed that this phage has its own enzymes and mechanisms for replication. This was done by using BLAST against other phage genomes. We also were informed by Anders that this phage belongs to the family of P7 phages. Further annotation should, thus, be proceeded by using BLAST against genomes belonging this this family of phages, as far as possible.

We also performed BLASTs against human and E. coli genomes and found matches against the genomes of these species that could not be found in bacteriophages. We thus concluded that there is contamination from these species in the samples. These are non the less irrelevant as we have been provided confirmation that the 76000 bp contig is the genome of the phage of interest. But this contig should be BLASTed against E. coli and human to assess if there might be reads from these species that might have been incorporated into the contig.

We decided to divided the work among us, where one of us would do research on the phage biology of our phage of interest and compare it to other phages as one means of characterization of our phage genome, one would research the ORFs of the phage genomes to try and predict unique genes for this phage, and one of us would do the research and testing necessary to find the terminal repeats of the genome. For my part I was given the task of finding the terminal repeats. This work will be conducted by researching the literature to find phages of the same family where the terminals have already been found to find clues about the terminal repeats, and also the coverage of the reads back to the genome should give clues about the position of the terminals, since it can be assumed that the repeats of the terminal have a higher coverage compared to the rest of the genome.

We have been looking for a tool to visualize the contigs and allow us to work with the genome in a visual manner, and Anders recommended the software Geneious. This software is a commercial software, but there is a possibility to get a 14 days free trial version. This is the software I will use to explore the terminal repeats.

Literature research on methods and tools for assembly of viral genomes

Have been doing literature research to find out more about the general approach of assembling and the corresponding software tools used in each step. One recent paper gives the overview of the approaches to assembling  viral genomes (R.J. Orton et al.).

The steps that are recommended for the de novo assembly and annotation of a viral genome according to R.J. Orton et al. would be first of all to put the raw read through a quality control to remove primers/adapter from the reads. Cutadapt and Trimmomatic are two widely used tools to remove adapters. The reads are also usually trimmed to remove poor-quality bases from the ends of reads. In addition to trimming the read they are also filtered, which means the complete removal of some reads because of low quality, short length or ambiguous base calling. For de novo assembly it is also recommended to remove exact read duplicates. Two widely used tools for filtering and trimming are Trim Galore! and PRINSEQ. Because phage samples often are contaminated with the host genome it is also recommended to “run a host sequence depletion step”. This means that the reads are first aligned to the host genome and only the unmapped reads are used for de novo assembly. But, in the meeting with Anders Nilsson, he said that phage genomes might contain sequences that are the same as the host genome, so a host sequence depletion step can probably not be performed thoughtlessly.

The next step is the assembly. For this step R.J. Orton et al. emphases the importance of removing adapters and trimming bases of low quality, since a very low amount of the DNA will be viral it will be important to have high quality yields. The most common algorithms for de novo assembly are overlap layout consensus (OLC) and de Bruijn graphs. They mention the assemblers MIRA (OLC), Edena (OLC), AbySS (de Bruijn) and Velvet (de Bruijn). One big issue with de novo assemblies are that they consist of a multitude of contigs and not the complete genome. This is because of “sequencing errors, repeat regions and areas with low converage”. The recommended way of joining contigs is to align them to a related reference genome. This will probably not be possible in this case, though, since phages evolve to fast which makes it impossible to use a reference genome. In discussions with Anders it was advised that this strategy might be possible to do for some of the genes, but not any longer stretches of the phage genome. If a reference genome is not available for alignment of the gaps R.J. Orton et al. recommends using paired-end reads or mate-pair reads to scaffold the contigs into the correct linear order. This should be possible to do in this case since the data is paired-ends. If the assembler does not do the scaffolding inherently there are stand-alone scaffolders such as Bambus2 and BESST. For paired-end data gap filling software such as IMAGE and GapFiller may also be used to close some of the gaps.

After the genome assembly draft is completed it is recommended to inspect the draft genome, for example by mapping the reads to the completed draft genome and looking for issues, such as miscalled bases, indels and regions of no coverage. Tools exist to help in this inspection process, such as ICORN2.

SPAdes is a recommended tool that can perform most of the steps of de novo assembly and the following quality control steps and corrections.

 

 

Introduction to project

Group had a meeting with Anders Nilsson on Wednesday (28th of November) who is a researcher at Stockholm University and researches phage genomes. Some information was provided about the phages and the project.

The most important take aways were that phages are difficult to assemble and annotate because their evolution are very fast, so when a phage genome is assembled and annotated it becomes irrelevant to use as a reference for other phage genomes. For this project the genome of the phage has to be assembled de novo.

For the annotation part it is interesting to find the capsid ends and to find the terminal repeats. We were also suggested to look for promoters, ORFs, ribosome binding sites, structural genes and start and terminal regions.

The genome of the phage of this project belongs to a phage that infects E. coli and the genome of the phage if 45-80 kbs long.