Quality control: FastQC before and after adapter removal

Since we realized that we need to quality controls of the reads and remove adapters and trimming and filtering we have found the software FastQC which is a quality control tool for high throughput data. FastQC does not modify the reads. It just give different kinds of graphs that report the quality of the reads.

We analyzed the original dataset of reads with FastQC and the quality of the data is reported as good even though some categories give the failure and warnings. The per base sequence quality shows that the general quality of the bases is good even though it starts to drop by the end of the reads.

The per base quality before the adapter removal.

The category for sequence duplication levels is the only category that gives a failure. This may indicate some kind of enrichment.

Sequence duplication levels of the raw reads.

The category for over-represented sequences gives a warning. A sequence is regarded as over-represented and the software will raise a warning if the sequence makes up more than 0.1% of the total number of sequences. An over-represented sequence may be due to biological importance or to contamination. In the table it can be seen that the over-representation of two of the sequences are due to the Illumina adapter sequence.

Overrepresented reads in the dataset.

The category for adapter content also gives a warning. The graph shows, though, that the source of the warning is a significant amount of Illumina adapter sequences.

The graph for adapter content shows a significant amount of Illumina adapter sequences.

The analysis of the raw reads shows that there is a significant amount of Illumina adapter sequences in the dataset and thus adapter removal should be performed. This was previously done with Trimmomatic and the resulting reads were again analyzed with FastQC.

The per base quality improved drastically compared to before the removal of the adapters as seen in the image below.

The per base quality improved after the adapters were removed.

Before the adapter removal the distribution of sequence lengths had a perfect score since all the sequences where 300 nucleotides. After the adapters were removed the distribution of sequence lengths changed, as expected. Most of the sequences are very close to 300 nucleotides. According to the FastQC manual the software raises a warning as if all the sequences are not the same length. But this should not be a big issue in this case.

The sequence length distribution changed from a pass to a warning after the adapter removal.

The sequence duplication levels did not change much after the removal of the adapters compared to before the adapter removal and still raises a failure, indicating an enrichment bias. The software raises a failure if more than 50% of the total amount of sequences are non-unique. I’m not sure if this will cause an issue with the assembly but we decided to continue with the next steps without looking closer into this issue.

The sequence duplication levels did not change compared to before the adapters were removed.

In the table of overrepresented sequences it can be seen that the sequences of the adapter have been removed, as expected. The rest of them from before the removal are still present and their sources are still unknown but should probably be BLASTed to find out more about their importance.

In the overrepresented table adapter sequences have been removed.

Finally, as expected the graph for adapter content shows that all the adapters were removed. This category turned from a warning to a pass.

The adapter content shows that all the adapters were removed.

In conclusion we decided that the adapter removal improved the data quality and decided to continue with the adapter removed reads.

Quality control: Removing adapters from reads in dataset

The group met today and we added quality control of the reads to the project plan. We looked at using either Trimmomatic or Cutadapt. Trimmomatic would be the preferred option since it is a trimming tool for Illumina NGS data. The adapter sequences to be removed are also distributed with the software, unlike Cutadapt, where the user has to specify the adapter sequences that should be removed.

According to the manual of Trimmomatic a FASTA file should be specified (in addition to the dataset) that contains the adapter sequences (and PCR sequences etc). This file is distributed with Trimmomatic and contains the Illumina adapter sequences. It does not really make sense to me that the pathway of this file needs to be specified since it is distributed with the software, and since the software only works with data from Illumina sequencing machines there is not a lot of different options for the user to specify. Finding this file in a distributed system like Uppmax is what took the most time in trying to use this software. The solution was instead to find this FASTA file with the Illumina sequences on the internet and upload it to the same folder as the files with the reads.

The options used for Trimmomatic were the default options that are specified in the example of the webpage of Trimmomatic (for paired end data):

module add bioinfo-tools

module add trimmomatic

java -jar $TRIMMOMATIC_HOME/trimmomatic.jar PE -phred33 ETECp7_TCCGCGAA-CAGGACGT_L001_R1_001.fastq.gz ETECp7_TCCGCGAA-CAGGACGT_L001_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

And the output was:

TrimmomaticPE: Started with arguments:
-phred33 ETECp7_TCCGCGAA-CAGGACGT_L001_R1_001.fastq.gz ETECp7_TCCGCGAA-CAGGACGT_L001_R2_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Using PrefixPair: ‘TACACTCTTTCCCTACACGACGCTCTTCCGATCT’ and ‘GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT’
Using Long Clipping Sequence: ‘AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA’
Using Long Clipping Sequence: ‘AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC’
Using Long Clipping Sequence: ‘GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT’
Using Long Clipping Sequence: ‘TACACTCTTTCCCTACACGACGCTCTTCCGATCT’
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 799754 Both Surviving: 718152 (89.80%) Forward Only Surviving: 77273 (9.66%) Reverse Only Surviving: 631 (0.08%) Dropped: 3698 (0.46%)
TrimmomaticPE: Completed successfully

I can’t really interpret the summary of the output. Are the results good or bad? But will look into it tomorrow.

Literature research on methods and tools for assembly of viral genomes

Have been doing literature research to find out more about the general approach of assembling and the corresponding software tools used in each step. One recent paper gives the overview of the approaches to assembling  viral genomes (R.J. Orton et al.).

The steps that are recommended for the de novo assembly and annotation of a viral genome according to R.J. Orton et al. would be first of all to put the raw read through a quality control to remove primers/adapter from the reads. Cutadapt and Trimmomatic are two widely used tools to remove adapters. The reads are also usually trimmed to remove poor-quality bases from the ends of reads. In addition to trimming the read they are also filtered, which means the complete removal of some reads because of low quality, short length or ambiguous base calling. For de novo assembly it is also recommended to remove exact read duplicates. Two widely used tools for filtering and trimming are Trim Galore! and PRINSEQ. Because phage samples often are contaminated with the host genome it is also recommended to “run a host sequence depletion step”. This means that the reads are first aligned to the host genome and only the unmapped reads are used for de novo assembly. But, in the meeting with Anders Nilsson, he said that phage genomes might contain sequences that are the same as the host genome, so a host sequence depletion step can probably not be performed thoughtlessly.

The next step is the assembly. For this step R.J. Orton et al. emphases the importance of removing adapters and trimming bases of low quality, since a very low amount of the DNA will be viral it will be important to have high quality yields. The most common algorithms for de novo assembly are overlap layout consensus (OLC) and de Bruijn graphs. They mention the assemblers MIRA (OLC), Edena (OLC), AbySS (de Bruijn) and Velvet (de Bruijn). One big issue with de novo assemblies are that they consist of a multitude of contigs and not the complete genome. This is because of “sequencing errors, repeat regions and areas with low converage”. The recommended way of joining contigs is to align them to a related reference genome. This will probably not be possible in this case, though, since phages evolve to fast which makes it impossible to use a reference genome. In discussions with Anders it was advised that this strategy might be possible to do for some of the genes, but not any longer stretches of the phage genome. If a reference genome is not available for alignment of the gaps R.J. Orton et al. recommends using paired-end reads or mate-pair reads to scaffold the contigs into the correct linear order. This should be possible to do in this case since the data is paired-ends. If the assembler does not do the scaffolding inherently there are stand-alone scaffolders such as Bambus2 and BESST. For paired-end data gap filling software such as IMAGE and GapFiller may also be used to close some of the gaps.

After the genome assembly draft is completed it is recommended to inspect the draft genome, for example by mapping the reads to the completed draft genome and looking for issues, such as miscalled bases, indels and regions of no coverage. Tools exist to help in this inspection process, such as ICORN2.

SPAdes is a recommended tool that can perform most of the steps of de novo assembly and the following quality control steps and corrections.