Literature research on methods and tools for assembly of viral genomes

Have been doing literature research to find out more about the general approach of assembling and the corresponding software tools used in each step. One recent paper gives the overview of the approaches to assembling  viral genomes (R.J. Orton et al.).

The steps that are recommended for the de novo assembly and annotation of a viral genome according to R.J. Orton et al. would be first of all to put the raw read through a quality control to remove primers/adapter from the reads. Cutadapt and Trimmomatic are two widely used tools to remove adapters. The reads are also usually trimmed to remove poor-quality bases from the ends of reads. In addition to trimming the read they are also filtered, which means the complete removal of some reads because of low quality, short length or ambiguous base calling. For de novo assembly it is also recommended to remove exact read duplicates. Two widely used tools for filtering and trimming are Trim Galore! and PRINSEQ. Because phage samples often are contaminated with the host genome it is also recommended to “run a host sequence depletion step”. This means that the reads are first aligned to the host genome and only the unmapped reads are used for de novo assembly. But, in the meeting with Anders Nilsson, he said that phage genomes might contain sequences that are the same as the host genome, so a host sequence depletion step can probably not be performed thoughtlessly.

The next step is the assembly. For this step R.J. Orton et al. emphases the importance of removing adapters and trimming bases of low quality, since a very low amount of the DNA will be viral it will be important to have high quality yields. The most common algorithms for de novo assembly are overlap layout consensus (OLC) and de Bruijn graphs. They mention the assemblers MIRA (OLC), Edena (OLC), AbySS (de Bruijn) and Velvet (de Bruijn). One big issue with de novo assemblies are that they consist of a multitude of contigs and not the complete genome. This is because of “sequencing errors, repeat regions and areas with low converage”. The recommended way of joining contigs is to align them to a related reference genome. This will probably not be possible in this case, though, since phages evolve to fast which makes it impossible to use a reference genome. In discussions with Anders it was advised that this strategy might be possible to do for some of the genes, but not any longer stretches of the phage genome. If a reference genome is not available for alignment of the gaps R.J. Orton et al. recommends using paired-end reads or mate-pair reads to scaffold the contigs into the correct linear order. This should be possible to do in this case since the data is paired-ends. If the assembler does not do the scaffolding inherently there are stand-alone scaffolders such as Bambus2 and BESST. For paired-end data gap filling software such as IMAGE and GapFiller may also be used to close some of the gaps.

After the genome assembly draft is completed it is recommended to inspect the draft genome, for example by mapping the reads to the completed draft genome and looking for issues, such as miscalled bases, indels and regions of no coverage. Tools exist to help in this inspection process, such as ICORN2.

SPAdes is a recommended tool that can perform most of the steps of de novo assembly and the following quality control steps and corrections.



Introduction to project

Group had a meeting with Anders Nilsson on Wednesday (28th of November) who is a researcher at Stockholm University and researches phage genomes. Some information was provided about the phages and the project.

The most important take aways were that phages are difficult to assemble and annotate because their evolution are very fast, so when a phage genome is assembled and annotated it becomes irrelevant to use as a reference for other phage genomes. For this project the genome of the phage has to be assembled de novo.

For the annotation part it is interesting to find the capsid ends and to find the terminal repeats. We were also suggested to look for promoters, ORFs, ribosome binding sites, structural genes and start and terminal regions.

The genome of the phage of this project belongs to a phage that infects E. coli and the genome of the phage if 45-80 kbs long.